jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link

Headings (Table of Content, TOC) #10

Open MrUnknown789556 opened 1 year ago

MrUnknown789556 commented 1 year ago

Trying to extract the table of content ("Introduction", ..., "References"), I looked into the extracted html file from Burdoc. It could fairly good distinguish the headings from other items in the text. Burdoc extracted all the named outlines correctly, but also an additional item, that is not part of the TOC. It additional extracted the item "Table 4".

I use the string "" to search in the generated html file for the TOC.

There seems to be no difference, if I use Burdoc with the parameter "--no-ml-tables" or not.

Screenshot 2023-06-22 135926 Table 4 (In html) Table 4 (In PDF) The.pdf

jennis0 commented 1 year ago

Ah, I think this one might be challenging as it's a false positive for one of the rules used to identify headings (a short bold piece of text directly preceding a standard paragraph and visually spaced from any prior text). Arguably it is a heading, albeit not one that'd be presented in a standard ToC.

I wouldn't expect --no-ml-tables to change this as turning off table-finding means we don't actually try to identify tables in the text, the text the contain still goes through the main text parsing pipeline (and Burdoc doesn't yet identify captions associated with tables so it wouldn't make a difference even if the table had been found)