run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.78k stars 267 forks source link

Reading order is messed up #45

Open mrtj opened 7 months ago

mrtj commented 7 months ago

Hello, the attached information leaflet has a somewhat complex layout with different columns, and llama parse is completely confused about the reading order: xanax-uk.pdf

I would expect to read first the whole upper left column with the title, then the second column on the right, all columns in the first row, then all columns from left to right in the second row. Instead the returned text after the first line of the first column jumps to the second column, back to the first, and messes up completely the sense of the text.

# Component Type: Leaflet

# Package leaflet: Information for the patient

If you are pregnant, think you might be pregnant now, are planning to become pregnant or if you are breast-feeding (see also the sections on ‘Pregnancy’ and ‘Breast-feeding’ for more information).

Do not take your tablets with an alcoholic drink.

## 250 microgram and 500 microgram Tablets

### Warnings and precautions

Talk to your doctor or pharmacist before taking Xanax if you:

- Have ever felt so depressed that you have thought about taking your own life.
- ...
Disiok commented 7 months ago

Thanks for raising this issue! We are actively improving this. Should be a lot better in the next few releases!

ah3243 commented 5 months ago

this is a big problem for me as well, if the pdf has a bad or incorrect reading order then it really limits what you can do with it. Even allowing simple templates, or something more manual or through a classifier based on font, textblock positioning, or standard heading titles could really help make this useful.