nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.02k stars 141 forks source link

Submit Document Object to self-hosted nlm-ingestor #78

Open BainMcKay opened 1 month ago

BainMcKay commented 1 month ago

Using Python, I am downloading GoogleDrive files to a local server, and caching them in a server tmp folder for failsafe-restart at checkpoint. I load the file into a Document Object which I then parse with semantic parsing. I want to submit the document object to local nlm-ingestor server for processing as well. But If I submit filename and document object, if fails on 404. I don't want to create a publicly available downloads folder on the mlm-ingestor server. Is there a way to submit the document objects, vs the url, to [self.parse_pdf(pdf_file)] in [file_reader.py]?

BainMcKay commented 1 month ago

Found the issue.

  1. the url in the example does not match the routing rule in the server code. it should be [http://yourserverip/api/parsedocument?renderFormat=all.]. The additional folders were not in the server RESTAPI routing path. The REST route is [api/parsedocument]
  2. The PDF rule parser is looking for a style attribute, which did not exist in TIKA text extraction from CV PDF documents I was using. It looks like there was an attempt to assign a default value if the style attribute was not found, causing the document to flush with an opaque error [404 NOT FOUND]. I tried conditionals base on style not found, but it threads down the code. As such, I added a condition, if style attribute not found, report it to the console log and flush the document. Then the calling client API switches to an other Parsing algorithm which does work.

BUG: The style parser bug needs to be fixed for the parser to work.

jamesvillarrubia commented 1 month ago

You may need to download the most recent jar file 2.9.2_v2, tika-server-standard-nlm-modified-2.9.2_v2.jar or downgrade to 2.4.1v6. There was a big update to bring nlm-ingestor in line with Apache Tika's most recent updates, but modifications to Tika's jars had to be done too. Bugs were introduced in 2.9.2_v1 regarding the style parser that may be fixed in v2.