run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.19k stars 312 forks source link

LlamaParse returns incomprehensible output #412

Closed kaitpw closed 1 month ago

kaitpw commented 1 month ago

Describe the bug Sending a Tesla Powerwall spec sheet to LlamaParse (LP) produces output with what seems like encrypted text. The pdf contains human readable tabular information, and it seems LP picks up on some sort of column/row construct, but the text itself in gibberish. a small snippet of output is below.

LWwhltlBnmshmtntrBtqqdmsNee+Fqhc 3/=ntsots
InWcRsWqsBWoWahkhsx 87,007=KQ=1
OULWwhltlHmotsUnksWfd 5//UCB 0485ll
OUCBHmotsUnksWfdPWmfd 5/,44/UCB
OUCBLOOSUnksWfdPWmfd 5/,37/UCB

When converted to a png LP does fine. I'm guessing it's some sort of anti-crawling thing implemented by tesla but reporting it as a bug if not. If its not a bug, it'd be cool if LP dealt with this problem on its own, its an easy thing for me to implement myself though.

Files powerwall-datasheet-(not-working).pdf powerwall-datasheet-(working).png

Job ID For the pdf: 0402600d-f4e3-4398-88ff-fb1a0d99f6a2 For the png: 6cade2b5-7849-42dc-9ea2-959701e1bac7

hexapode commented 1 month ago

Thank you for reporting. the font in the document is 'encrypted', try to copy paste from it. We will do a patch to decypher font encoded this particular way and let you know. Should lend in prod by EOW.

hexapode commented 1 month ago

Hi! There was an issue with our handling of Type1C font, this is now fix in production. your document should work

kaitpw commented 1 month ago

Sooo sick thank you! I rly appreciate the quick turn around.