nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
971 stars 124 forks source link

numpy error while parsing #9

Open gabfeudo opened 6 months ago

gabfeudo commented 6 months ago

I'm getting the following error when parsing some PDFs, but not with others. Unfortunately I cannot share the files, but I can share some metadata upon request.

nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
nlm-ingestor  |   return _methods._mean(a, axis=axis, dtype=dtype,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor  |   ret = ret.dtype.type(ret / rcount)
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
nlm-ingestor  |   ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
nlm-ingestor  |   arrmean = um.true_divide(arrmean, div, out=arrmean,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor  |   ret = ret.dtype.type(ret / rcount)

Endpoint: http://nlm-ingestor:5001/api/parseDocument?renderFormat=all called through LLMSherpa library

Any suggestion?

ansukla commented 6 months ago

This is most likely happening due to some error in the tika side. Is there OCR content in your PDF? If yes, then use applyOCR=yes in arguments. If not, you will need to run the tika server by itself and then use this notebook to https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb to debug. Most errors in tika happen due to some issue in character encodings and it will need you to go through the java code in https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm.


To build tika use:
``` apache-maven-3.9.5/bin/mvn package -Dmaven.test.skip=true -Dcheckstyle.skip
gabfeudo commented 6 months ago

This is most likely happening due to some error in the tika side. Is there OCR content in your PDF? If yes, then use applyOCR=yes in arguments. If not, you will need to run the tika server by itself and then use this notebook to https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb to debug. Most errors in tika happen due to some issue in character encodings and it will need you to go through the java code in https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm.

To build tika use:
``` apache-maven-3.9.5/bin/mvn package -Dmaven.test.skip=true -Dcheckstyle.skip

After some tests I figured out that the problem is because of particular fonts used inside the PDF files. What approach could I follow to bypass/solve this?

Edit 1: tried with applyOcr=yes but same result. Was just testing, but the file is a Figma exported file and has no OCR text. Everything is on text layer

Edit 2: just to know, the fonts used inside one of the PDFs are Onest and Inter

ansukla commented 6 months ago

You will need to setup the java project for nlm-tika and modify this file: https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java

I'd go to line 393 and see what particular character results in this issue. Then find some way to handle that character in sanitizeText. After this you can rebuild and test the jar file generated under tika-server:

/Library/Java/JavaVirtualMachines/jdk-17.jdk/Contents/Home/bin/java -jar tika-server/tika-server-standard/target/tika-server-standard-2.4.1.jar

Hope this helps, this takes a while to debug. It is rare but does happen every once in a while.

gabfeudo commented 6 months ago

@ansukla thank you so much! Definitely a boring job but it will be useful

gabfeudo commented 6 months ago

Hope this helps, this takes a while to debug. It is rare but does happen every once in a while.

I just noticed that the bug is very frequent. I was having problem even with the Azure hosted version. I don't know what or if something changed over time, but I have very frequent parsing error

gabfeudo commented 6 months ago

@ansukla I did more tests and what I discovered is that

I wasn't able to debug tika-parser but I did all the test that came to my mind. I hope these information could help you fix this problem.

My conclusion is that the problem is not on chars processed as you mentioned, but something like Apache Tika is not able to read the right PDF structure

Let me know if I can help

Edit: I found that the working PDF file, opened with a text editor, contains MediaBox metadata, while the Figma-exported file doesn't. So the Python error could be related to this kind of problem because maybe it's trying to do divisions with NaN