Fixes Issue #79. Fixed repetition of text in to_text and to_html methods of the Document class

AaryanTR commented 4 months ago

Fixes #79

Here is the current implementation of the to_text method:

def to_text(self):
    """
    Returns text of a document by iterating through all the sections '\n'
    """
    text = ""
    for section in self.sections():
        text = text + section.to_text(include_children=True, recurse=True) + "\n"
    return text

The issue occurs when the document tree has the following (or similar) structure:

Section-1
├── Section-2
├── Section-3

The text of Section-2 is included in the output when the to_text method is called (recursively) for Section-1 as well as forSection-2. Similarly, the text of Section-3 is also duplicated in the output.

To remove the duplicates, iterate over all the sections in the document and choose the top sections i.e., those sections which are not a children of any other section. At the end, concatenate the (recursive) text of all the top sections.

Here is a summary of the changes made:

The method _get_top_sections returns all the top sections in the document tree.
to_text and to_html methods accept a boolean parameter include_duplicates.
If include_duplicates is False, only the top sections are considered.
Otherwise, all the sections are considered.

You can take a look at this fix in Google Colab here.

rodrigomeireles commented 3 months ago

@ansukla would you kindly review this PR? Or was this project abandoned?

ansukla commented 3 months ago

@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.

ansukla commented 3 months ago

@AaryanTR - Did you run the test cases and would it be possible for you to add a new one to test this change? This being a significant change to the two main functions, would help to have the test cases to better understand the behaviour and retain it through other changes in future.

AaryanTR commented 3 months ago

Hi @ansukla Thank you for your response. I have ran the tests and it is successful. I have also added test cases for to_text and to_html methods in the latest commit.

Test before adding the new test cases:

Test after adding the new test cases:

AaryanTR commented 3 months ago

@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.

I would love to be a maintainer. But this is my work account. Is it possible for you to add my personal github account @aaryan200 as a maintainer?

Thank you

nlmatics / llmsherpa

Fixes Issue #79. Fixed repetition of text in to_text and to_html methods of the Document class #83