nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
971 stars 124 forks source link

How to use ingestors? #12

Open frankiedrake opened 5 months ago

frankiedrake commented 5 months ago

The documentation provides an example of using LayoutPDFReader class to process PDF documents. But it also says about various ingestors (XML, HTML, text, etc.) but not a single example of how can we use it and how it is connected to a LayoutPDFReader. Maybe there's a LayoutTextReader or something similar?

ansukla commented 5 months ago

Hi - you will need to run your own server to get these capabilities. See instructions here: https://github.com/nlmatics/nlm-ingestor/pkgs/container/nlm-ingestor. The LayoutPDFReader is a bit of a misnomer, you can pass in different kind of documents and it will work the same way.

kzecchini commented 5 months ago

@ansukla I tried to use this to parse XML documents, but can't get it to work properly.

Tried hosting the server locally in the docker image, and passing XML documents through the LayoutPDFReader. But the document was not parsed properly.

I also tried modifying some code to change the MIME type of the POST request to application/xml and text/xml in the api request to the same endpoint http://localhost:5010/api/parseDocument?renderFormat=all, but that didn't work either.

Any examples of how to use this service to chunk XML documents would be great - thanks!