This repo provides the service code for llmsherpa API to connect. This repo contains custom RAG (retrieval augmented generation) friendly parsers for the following file formats:
The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data from nlmatics modified version of tika found here https://github.com/nlmatics/nlm-tika. The PDF parser works off text layer and also offers a OCR option (apply_ocr) to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a nlmatics modified version of tika which uses tesseract underneath. Check out the notebook pdf_visual_ingestor_step_by_step to experiment directly with the PDF parser.
The PDF Parser offers the following features:
A special HTML parser that creates layout aware blocks to make RAG performance better with higher quality chunks.
A special text parser which tries to figure out lists, tables, headers etc. purely by looking at the text and no visual, font or bbox information.
There are two ways to process these types of documents
java -jar <path_to_nlm_ingestor>/jars/tika-server-standard-nlm-modified-2.9.2_v2.jar
!pip install nlm-ingestor
python -m nlm_ingestor.ingestion_daemon
A docker image is available via public github container registry.
Pull the docker image
docker pull ghcr.io/nlmatics/nlm-ingestor:latest
Run the docker image mapping the port 5001 to port of your choice.
docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest-<version>
Once you have the server running, you can use the llmsherpa API library to get chunks and use them for your LLM projects. Your llmsherpa_url will be: "http://localhost:5010/api/parseDocument?renderFormat=all"
Sample test code to test the server with llmsherpa parser is in this notebook.
Over the course of 4 years, nlmatics team evaluated a variety of options including a yolo based vision parser developed by Tom Liu and Yi Zhang. Ultimately, we settled with the rule based parser due to the following reasons.
The PDFparser visual_ingestor and new_indent_parser was written by Ambika Sukla with additional contributions from Reshav Abraham who wrote the initial code to modify tika, Tom Liu who wrote the original Indent Parser and Kiran Panicker who made several improvements to the parsing speed, table parsing accuracy, indent parsing accuracy and reordering accuracy.
The HTML Ingestor was written by Tom Liu.
The Markdown Parser was written by Yi Zhang.
The Text Ingestor was written by Reshav Abraham.
The XML Ingestor was written by Ambika Sukla primarily to process PubMed XMLs.
The line_parser which serves as a core sentence processing utility for all the other parsers was written by Ambika Sukla.
Also we are thankful to the Apache PDFBox and Tika developer community for their years of work in providing the base for the PDF Parser.
Nlm modified version of Tika can be found in the 2.4.1-nlm branch here https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm For convenience, a compiled jar file of the code is included in this repo in jars/ folder. In some cases, your PDFs may result in errors in the Java server and you will need to modify the code there to resolve the issue and recompile the jar file.
The following files are changed:
1) https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java 2) https://github.com/nlmatics/nlm-tika/blob/2.4.1-nlm/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
The above is to add font and co-ordinates to every text element. It also removes watermarks.
The above is to add lines and rectangles that can potentially help with table detection.
To see the impact of these changes, see the first part of the notebook here: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks/pdf_visual_ingestor_step_by_step.ipynb
Some ideas for future work: 1) Make the changes independent of tika by writing own wrapper over pdfbox 2) Upgrade to latest version of tika 3) Cleanup the format of returned html to make it more css friendly