I've now spent quite some time on testing to parse academic papers, with mediocre success only. In particular, the following issues occur:
A new page in the pdf always breaks the current paragraph, such that often sentences are interrupted and continued in the next paragraph.
If the pdf has footer or header information, this is repeated over an over, sometimes even classified as headers. Such footer info is then put between the body text when the page switch occurs, so that one cannot reconstruct interrupted sentences/paragraphs (see above)
Footnotes are not systematically recognized, numeration is in most cases lost, and sometimes footnotes appear in the body text
I've now spent quite some time on testing to parse academic papers, with mediocre success only. In particular, the following issues occur:
Changing LLM instructions hasn’t helped.