Open gabfeudo opened 6 months ago
This is most likely happening due to some error in the tika side. Is there OCR content in your PDF? If yes, then use applyOCR=yes in arguments. If not, you will need to run the tika server by itself and then use this notebook to https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb to debug. Most errors in tika happen due to some issue in character encodings and it will need you to go through the java code in https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm.
To build tika use:
``` apache-maven-3.9.5/bin/mvn package -Dmaven.test.skip=true -Dcheckstyle.skip
This is most likely happening due to some error in the tika side. Is there OCR content in your PDF? If yes, then use applyOCR=yes in arguments. If not, you will need to run the tika server by itself and then use this notebook to https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb to debug. Most errors in tika happen due to some issue in character encodings and it will need you to go through the java code in https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm.
To build tika use: ``` apache-maven-3.9.5/bin/mvn package -Dmaven.test.skip=true -Dcheckstyle.skip
After some tests I figured out that the problem is because of particular fonts used inside the PDF files. What approach could I follow to bypass/solve this?
Edit 1: tried with applyOcr=yes
but same result. Was just testing, but the file is a Figma exported file and has no OCR text. Everything is on text layer
Edit 2: just to know, the fonts used inside one of the PDFs are Onest and Inter
You will need to setup the java project for nlm-tika and modify this file: https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
I'd go to line 393 and see what particular character results in this issue. Then find some way to handle that character in sanitizeText. After this you can rebuild and test the jar file generated under tika-server:
/Library/Java/JavaVirtualMachines/jdk-17.jdk/Contents/Home/bin/java -jar tika-server/tika-server-standard/target/tika-server-standard-2.4.1.jar
Hope this helps, this takes a while to debug. It is rare but does happen every once in a while.
@ansukla thank you so much! Definitely a boring job but it will be useful
Hope this helps, this takes a while to debug. It is rare but does happen every once in a while.
I just noticed that the bug is very frequent. I was having problem even with the Azure hosted version. I don't know what or if something changed over time, but I have very frequent parsing error
@ansukla I did more tests and what I discovered is that
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dEmbedAllFonts=true -sOutputFile=output.pdf -f input.pdf
) but it doesn't work eitherI wasn't able to debug tika-parser but I did all the test that came to my mind. I hope these information could help you fix this problem.
My conclusion is that the problem is not on chars processed as you mentioned, but something like Apache Tika is not able to read the right PDF structure
Let me know if I can help
Edit: I found that the working PDF file, opened with a text editor, contains MediaBox metadata, while the Figma-exported file doesn't. So the Python error could be related to this kind of problem because maybe it's trying to do divisions with NaN
I'm getting the following error when parsing some PDFs, but not with others. Unfortunately I cannot share the files, but I can share some metadata upon request.
Endpoint: http://nlm-ingestor:5001/api/parseDocument?renderFormat=all called through LLMSherpa library
Any suggestion?