bad ocr - Githubissues

run-llama / llama_parse

Parse files for optimal RAG

https://www.llamaindex.ai

MIT License

1.83k stars 162 forks source link

bad ocr #44

Open cognitivetech opened 4 months ago

cognitivetech commented 4 months ago

Questioning_development_review.PDF

I wasn't intentionally testing OCR, but here we are. I won't share and example but its missing spaces\newlines and puts numbers where they don't belong.

when I run it through ocrmypdf with the following command: ocrmypdf --clean --output-type pdf --redo-ocr then re-run through llama-parse I get a much better result

anoopshrma commented 4 months ago

Thank you for the feedback @cognitivetech . It'll get reviewed soon!