nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

Skips first few lines in PDF. #57

Open samgriek opened 4 months ago

samgriek commented 4 months ago

I'm running the server in docker:

image: ghcr.io/nlmatics/nlm-ingestor:latest

I've only tested with one 300page PDF and it seems to skip the first couple lines of the PDF. It doesn't seem to be an issue but It makes me wonder if anything else is being skipped. This is the same whether I convert to text, use sections, or convert to html.

What might be the cause?

dandawg commented 4 months ago

I'm also seeing this. When using llmsherpa.readers.LayoutPDFReader with the read_pdf method, the returned output is missing the title line of my PDF--which happens to be one of the first lines.

samgriek commented 3 months ago

At least it's not just me! I would be open to fixing it but I'm guessing it's related to the NLP model?

aleksvercau commented 3 months ago

Same here!