run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.44k stars 248 forks source link

bad parsing results #354

Open yassello55 opened 4 weeks ago

yassello55 commented 4 weeks ago

i have got a bad parsing experience with a pdf file (uber annual report ) . the parsing results i got on an md file are so bad. ihave tried with many others pdf file and results are ok. i think the issue is related to the way this pdf has been designed . e23076_uber-ars.pdf

MD results for example : Uber 2022 Annual Report

Uber�s Mission

We reimagine the way the world moves for the better

Our Values

Do the right thing Go get it Trip obsessed Build with heart
Period. Bring the mindset of a champion. Make magic in the marketplace. We care.
Stand for safety See the forest and the trees One Uber Great minds don't think alike
Safety never stops. Know the details that matter. Bet on something bigger. Diversity makes us stronger.

UNITED

Large

UBER

SPECIA

looking

PART8+252=B.52?.;B;.201=

Platform>;!5*=/8;6

D 8+252=B
D .52?.;B
.201=
8+252=B
D

we=!;2?,B7-!;8=.,=287!B6.7=<<7-277,25#.;?2,.<7=2=;><=

software8+252=B.52?.;B6958B..<-9=270 =8  .@ 'B 8/ '8;42706958B..700.6.7=

6958B.. .?.5896.7= *7- ".=.7=287
2?.;<2=B *7- 7,5><287
;2?.; *7- 8>;2.; '.55 .270

social.;=27 /,=8;< 6B 1?.  6=.;25 -?.;<. .//.,= 87 8>; +><27.<< /277,25 ,87-2=287 7- ;.<>5=< 8/ 89.;=287< (8> <18>5-,;./>55B ,87<2-.; =1. /8558@270 ;2<4< =80.=1.; @2=1 55 8/ =1. 8=1.; 27/8;6=287 ,87=27.- 27 =12< 77>5 ".98;= 87 8;627,5>-270=1.<.,=287<=2=5.-E#9.,258=.".0;-2708;@;-884270#==.6.7=<F7-E70.6.7=G2<,><<287*7-7*5B<2<8/27*7,2*587-2=287*7-".<>5=<8/9.;*=287<F*7-8>;/27*7,2*5<=*=.6.7=<*7-=1.;.5*=.-78=.<27,5>-.-.5<.@1.;.27=12<775".98;=878;6 7B8/=1./8558@270;2<4<,8>5-1?.7-?.;..//.,=878>;+><27.<</27*7,2*5,87-2=28789.;*=270;.<>5=<8;9;8<9.,=<*7-,8>5-,*><.=1.=;*-2709;2,.8/8>;,86687<=8,4=8-.,527.@12,1@85-,><.B8>=858<.558;9;=8/B8>;27?.<=6.7= >; +><27.<< /277,25 ,87-2=287 89.;=270 ;.<>5=< 8; 9;8<9.,=< ,8>5- 5<8 +. 1;6.- +B ;2<4< 7- >7,.;=27=2.< 78=,>;;.7=5B478@7=8>8;=1*=@.,;;.7=5B-878=+.52.?.;.6=.;2*5

BinaryBrain commented 4 weeks ago

It looks like this PDF font is encrypted. We'll have a look at it.

bAlemar commented 4 weeks ago

I'm with the same issue...LlamaParse started to give me bad parsing results. I tested the parsing of a pdf file that I already parsing to md with LlamaParse and now the result is so much worse. I am not sure of what has changed with the API, but I am exploring alternative solutions.

hexapode commented 3 weeks ago

Thanks for sharing the PDF, we identify the bug. The font is buggy in the pdf (try to copy paste from it), and we didn't identify it as such. This specific issue will be fixed with next release (this week) and I will let you know when it is available so you can test.

hexapode commented 3 weeks ago

Hi! This is now mostly fix in production. there still is an issue in some pages with uber custom font, and it will be address in a future release, most likely end of week or early next week (some text will appear as:


ride or received a Delivery order on our platform at least once in a given month  averaged over each month in the quarter  While a
unique consumer can use multiple product offerings on our platform in a given month  that unique consumer is counted as only one
MAPC  We use MAPCs to assess the adoption of our platform and frequency of transactions  which are key factors in our penetration
of the countries in which we operate
```  )