Tables in a PDF document with headers that appear as two words stacked on top of each other are parsed as separate headers. In the image below, "senior leadership" is the column header.
As well, the parser sees the table's title as part of the table itself, so the table ends up being only two columns wide. The resulting markdown from LlamaParse appears like this:
Here is the PDF in question. It's a single page extracted from the 2022 Coca-Cola business and sustainability report. I've been using it to test various PDF parsing tools. It typifies what I encounter when processing PDFs for embedding.
Tables in a PDF document with headers that appear as two words stacked on top of each other are parsed as separate headers. In the image below, "senior leadership" is the column header.
As well, the parser sees the table's title as part of the table itself, so the table ends up being only two columns wide. The resulting markdown from LlamaParse appears like this:
Two different Markdown renderers give this as a result:
With manual repairs, the Markdown renders like this:
Any help with this would be appreciated!