nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
972 stars 124 forks source link

Is it recommended to use the new indent parser? #2

Open jpbalarini opened 6 months ago

jpbalarini commented 6 months ago

Hi! I'm looking to use the nlm-ingestor + llmsherpa to ingest PDFs. I saw that there is an option to use a different algorithm with the useNewIndentParser flag. What is the difference with the old parser? Is it recommended for use in a production app? Is it still experimental or a WIP?

Thanks!

ansukla commented 6 months ago

This depends on the type of document you have. If it is a well structured legal/financial document, this may provide you more consistent structure. If this is a powerpoint file turned into PDF and has inconsistent header structure, neither of the parsers would be fully accurate. Best is to try both the indenting schemes and see which one produces better results for you.