nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.37k stars 134 forks source link

Fixes Issue #79. Fixed repetition of text in to_text and to_html methods of the Document class #83

Closed AaryanTR closed 3 months ago

AaryanTR commented 4 months ago

Fixes #79

Here is the current implementation of the to_text method:

def to_text(self):
    """
    Returns text of a document by iterating through all the sections '\n'
    """
    text = ""
    for section in self.sections():
        text = text + section.to_text(include_children=True, recurse=True) + "\n"
    return text

The issue occurs when the document tree has the following (or similar) structure:

Section-1
├── Section-2
├── Section-3

The text of Section-2 is included in the output when the to_text method is called (recursively) for Section-1 as well as forSection-2. Similarly, the text of Section-3 is also duplicated in the output.

To remove the duplicates, iterate over all the sections in the document and choose the top sections i.e., those sections which are not a children of any other section. At the end, concatenate the (recursive) text of all the top sections.

Here is a summary of the changes made:

You can take a look at this fix in Google Colab here.

rodrigomeireles commented 3 months ago

@ansukla would you kindly review this PR? Or was this project abandoned?

ansukla commented 3 months ago

@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.

ansukla commented 3 months ago

@AaryanTR - Did you run the test cases and would it be possible for you to add a new one to test this change? This being a significant change to the two main functions, would help to have the test cases to better understand the behaviour and retain it through other changes in future.

AaryanTR commented 3 months ago

Hi @ansukla Thank you for your response. I have ran the tests and it is successful. I have also added test cases for to_text and to_html methods in the latest commit.

Test before adding the new test cases:

Screenshot 2024-06-12 at 11 29 09 PM

Test after adding the new test cases:

Screenshot 2024-06-13 at 12 23 08 AM
AaryanTR commented 3 months ago

@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.

I would love to be a maintainer. But this is my work account. Is it possible for you to add my personal github account @aaryan200 as a maintainer?

Thank you