jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link

Headings #9

Open MrUnknown789556 opened 1 year ago

MrUnknown789556 commented 1 year ago

If headings (and subheadings) are internally defined in the PDF file, these headings can be very easily distinguished and extracted from the generated JSON file when calling Burdoc. If there are no such internally headings (and subheadings), all the headings can be seen in the JSON file mixed together with the describing text, tables etc. etc.

How is it possible to distinguish and extract (for instance by using Regex) from the generated JSON file ONLY all the headings (and subheadings), if there are no internally headings (and subheadings) internally in the PDF file and defined in the "toc" section?

Investigation.pdf Investigation.txt

jennis0 commented 1 year ago

Sorry for the delayed reply! Burdoc actually already makes a best effort attempt to do this! If you look in the produced JSON file there is a top-level entry called 'page_hierarchy', which contains all headings detected in the text. Unfortunately Burdoc currently doesn't handle maths so in your example a lot of formulas are extracted as 'h6' headings but it looks like it correctly identifies all other headings aside from the document title as 'h5' entries.

The following code shows how to get the page headings and then get all content between the first heading found and the 2nd. Hope this helps!

import json

#Load data
with open("investigation.txt") as f:
    extract = json.load(f)

headings = []
#Get all 'h5' headings from document
for page,headings in extract['page_hierarchy].items():
    for h in headings:
        if h['assigned_heading'] == 'h5':
            headings.append(h)

first_content = []
page = heading[0]['page']
index = heading[0]['index'][0]

#Iterate over all items until we reach the next heading
while page < heading[1]['page'] or index < heading[1]['index'][0]:
  if index >= len(extract['content'][page]):
    page  += 1
    index = 0
    continue

  first_content.append(extract['content'][page][index]
  index += 1