nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
922 stars 112 forks source link

nlm-ingestor is SUPER SLOW #39

Open pashpashpash opened 3 months ago

pashpashpash commented 3 months ago

As mentioned here: https://github.com/nlmatics/nlm-ingestor/issues/37

Chunking even small PDFs (<20 pages) takes longer than 30 seconds! This is a huge problem in any production environment. Why is this happening?

ansukla commented 3 months ago

Are you using OCR?

pashpashpash commented 3 months ago

Are you using OCR?

Nope. OCR is off. 30+ s for basic PDFs

brett-matson commented 1 month ago

Same here. I'm getting 49.8 seconds for this 78 page PDF, no OCR: https://www.hireexpress.com.au/files/operation_manuals/200840_O.pdf

Anyone found a way to speed it up?

Edit: It's closer to 80 seconds for the above PDF

pashpashpash commented 4 weeks ago

@ansukla sorry to bother, but any chance we can get an update/priority on this? It's severely impacting production performance