Headings - Githubissues

Sorry for the delayed reply! Burdoc actually already makes a best effort attempt to do this! If you look in the produced JSON file there is a top-level entry called 'page_hierarchy', which contains all headings detected in the text. Unfortunately Burdoc currently doesn't handle maths so in your example a lot of formulas are extracted as 'h6' headings but it looks like it correctly identifies all other headings aside from the document title as 'h5' entries.

The following code shows how to get the page headings and then get all content between the first heading found and the 2nd. Hope this helps!

import json

#Load data
with open("investigation.txt") as f:
    extract = json.load(f)

headings = []
#Get all 'h5' headings from document
for page,headings in extract['page_hierarchy].items():
    for h in headings:
        if h['assigned_heading'] == 'h5':
            headings.append(h)

first_content = []
page = heading[0]['page']
index = heading[0]['index'][0]

#Iterate over all items until we reach the next heading
while page < heading[1]['page'] or index < heading[1]['index'][0]:
  if index >= len(extract['content'][page]):
    page  += 1
    index = 0
    continue

  first_content.append(extract['content'][page][index]
  index += 1

jennis0 / burdoc

Headings #9