Open dkarthicks27 opened 5 months ago
For me the endpoint https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all works but the local server using the docker image does not
This is the error in the server:
`Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/main.py", line 48, in parse_document
returndict, = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
p["style"], p.text, page_width
~^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in getitem
return self.attrs[key]
KeyError: 'style'`
I passed in the sample PDF provided in the example code https://arxiv.org/pdf/1910.13461.pdf
I have exactly the same issue with the local server using the docker image parsing the attached PDF:
KeyError: 'style' Traceback (most recent call last): File "/app/nlm_ingestor/ingestion_daemon/main.py", line 48, in parse_document returndict, = ingestor_api.ingest_document( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( ^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init self.parse(pages) File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse p["style"], p.text, page_width ~^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in getitem return self.attrs[key]
KeyError: 'style'
[Aceton.pdf](https://github.com/user-attachments/files/16140536/Aceton.pdf)
I am getting ingestion failed when I try to hit the endpoint https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all
to fetch chunks on a any Pdf document
This is the response:
{'return_dict': {}, 'status': 'ingest_failed'}
I tried printing out the response_json