run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.24k stars 316 forks source link

Stacked table headers seen as separate headers #118

Open dcellison opened 8 months ago

dcellison commented 8 months ago

Tables in a PDF document with headers that appear as two words stacked on top of each other are parsed as separate headers. In the image below, "senior leadership" is the column header.

table-example

As well, the parser sees the table's title as part of the table itself, so the table ends up being only two columns wide. The resulting markdown from LlamaParse appears like this:

| |2022 Gender Representation by Level (global)1|
|---|---|
| |SENIOR|MIDDLE|LEADERSHIP|MANAGEMENT|PROFESSIONALS|TOTAL|
|Female|39.0%|51.3%|36.0%|44.0%|
|Male|61.0%|48.7%|64.0%|56.0%|

Two different Markdown renderers give this as a result:

rendered

With manual repairs, the Markdown renders like this:

repaired

Any help with this would be appreciated!

dcellison commented 8 months ago

Here is the PDF in question. It's a single page extracted from the 2022 Coca-Cola business and sustainability report. I've been using it to test various PDF parsing tools. It typifies what I encounter when processing PDFs for embedding.