nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

pdf problem? #14

Open legaltextai opened 8 months ago

legaltextai commented 8 months ago

i scan a document and pdf allows copy and paste the text but i get this error with layoutpdf


     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```
is it because of the format? is there anything i could do to turn the pdf into a more readable format for your api? 
SuadAshammari commented 8 months ago

Thank you for this great work! I have the same issue here! I believe it is because of the PDF that I'm using. Once I change the pdf, it is working. Any updates to make it work for all the files without any errors?

Thank you again!

ansukla commented 8 months ago

Hi - Please update your library with latest version and try again. If the problem persists, please share the pdf if it is possible.

legaltextai commented 8 months ago

lsat_10_15.pdf

ansukla commented 8 months ago

Hi @legaltextai,

We do not support OCR at the moment. This PDF is OCR and does not have a text layer.

SuadAshammari commented 8 months ago

Roughley - 2020 - Five Years of the KNIME Vernalis Cheminformatics C.pdf I have this pdf it is not OCR.

ansukla commented 8 months ago

@SuadAshammari - This one will take some time to resolve. Will update you when we have a resolution.

lan2720 commented 7 months ago

Any fix here?

SuadAshammari commented 6 months ago

@SuadAshammari - This one will take some time to resolve. Will update you when we have a resolution.

Any update? @lan2720

Thanks!

pingtv commented 5 months ago

TDS.pdf

I have the same issue here. This is a fillable PDF @ansukla

madhuprakash19 commented 1 day ago

I added print statement for response_json it gave reason:style , i also checked the docker logs , its failing here, error uploading file, stacktrace: Traceback (most recent call last): File "/app/nlm_ingestor/ingestion_daemon/main.py", line 48, in parse_document returndict, = ingestor_api.ingest_document( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( ^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init self.parse(pages) File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse p["style"], p.text, page_width ~^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in getitem return self.attrs[key]


KeyError: 'style'
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
    p["style"], p.text, page_width
    ~^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'style'