nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
972 stars 124 forks source link

ZeroDivisionError: float division by zero #20

Open opiethehokie opened 5 months ago

opiethehokie commented 5 months ago

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 15:51:32] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 15:51:32] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 - INFO:main:Parsing document: c8fc5a1d-e188-4c12-9b17-8367b29a5fb0.pdf INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmpnl6e2o_d.pdf with name c8fc5a1d-e188-4c12-9b17-8367b29a5fb0.pdf INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False} INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 760.0334ms on workspace processing page: 0 Number of p_tags.... 2 processing page: 1 Number of p_tags.... 112 processing page: 2 Number of p_tags.... 116 processing page: 3 Number of p_tags.... 120 processing page: 4 Number of p_tags.... 93 processing page: 5 Number of p_tags.... 91 processing page: 6 Number of p_tags.... 106 processing page: 7 Number of p_tags.... 107 processing page: 8 Number of p_tags.... 110 processing page: 9 Number of p_tags.... 95 processing page: 10 Number of p_tags.... 113 processing page: 11 Number of p_tags.... 106 G, GWI, -> Portfolios~ mismatch 2 4 processing page: 12 Number of p_tags.... 50 processing page: 13 Number of p_tags.... 2 processing page: 14 Number of p_tags.... 107 processing page: 15 Number of p_tags.... 216 processing page: 16 Number of p_tags.... 107 processing blocks in page: 2 processing blocks in page: 3 processing blocks in page: 3 processing blocks in page: 4 processing blocks in page: 5 processing blocks in page: 5 processing blocks in page: 6 processing blocks in page: 7 processing blocks in page: 8 processing blocks in page: 8 error uploading file, stacktrace: Traceback (most recent call last): File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document ingest_status, return_dict = ingestor_api.ingest_document( File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options) File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks( File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init self.parse(pages) File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse self.organize_and_indent_blocks() File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 2676, in organize_and_indent_blocks block_idx, footer_count = self.build_table(block_idx, organized_blocks, table_start_idx, File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3066, in build_table block_idx, table_end_idx = self.make_table_with_footers(block_idx, footer_count, footers, File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3232, in make_table_with_footers and (prev_box[1]/left > 1.1) # or is_aligned) ZeroDivisionError: float division by zero

Sorry I'm unable to share the file. Updating left in visual_ingestor.py calculate_block_bounds() to a minimum of 1 has allowed me to get past this, but it's not clear to me if that's the best fix or not.

SirAbsolute0 commented 2 days ago

Same problem