nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

Bug in API function: Incorrect behavior with repeated sections. #49

Open gshreya5 opened 5 months ago

gshreya5 commented 5 months ago

The issue arises when extracting HTML content from a document using the .to_html() method after reading a PDF with

doc = pdf_reader.read_pdf(pdf_url)
doc.to_html(include_children=True, recurse=True)

When iterating through the sections, the loop processes both the parent and child sections, causing repetitive content in the HTML output. Resulting in unintended duplication.

Here is the relevant code:

    def to_html(self):
        """
        Returns html for the document by iterating through all the sections
        """
        html_str = "<html>"
        for section in self.sections():
            html_str = html_str + section.to_html(include_children=True, recurse=True)
        html_str = html_str + "</html>"
        return html_str
ansukla commented 5 months ago

This should not happen since we are only going through first level of sections where each section is distinct and then for each section traversing all the way to the end. Can you give an example.

gshreya5 commented 5 months ago
from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = 'https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes'
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf')
HTML(doc.to_html())

In the output, sections repeat.

gshreya5 commented 5 months ago

I started my own server given the instructions but when reading doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf') I received the following error: 347, in render_json table_block["left"], KeyError: 'left'

Any insights on how to address this would be greatly appreciated.

gshreya5 commented 5 months ago

@ansukla Hi, is there any update? Thanks.

PrashantK-DS commented 4 months ago

Facing same issue of repeated section. I had to post-process it to truncate the html to avoid repetition, but that approach is not that efficient. Its better to directly get exact extraction to html with no repetition from llm-sherpa to avoid unnecessary problems in production.

jpbalarini commented 2 months ago

Same is happening to me. Both to_text and to_html repeat sections in the output

thomastiotto commented 2 months ago

I'm facing the same issue with Document.to_text(). I posted my findings and solution in #73 .

yannickgiguere commented 1 month ago

I started my own server given the instructions but when reading doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf') I received the following error: 347, in render_json table_block["left"], KeyError: 'left'

Any insights on how to address this would be greatly appreciated.

Seems the same issue as reported by https://github.com/nlmatics/nlm-ingestor/issues/18