jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link

Extraction of headings from academic journal articles #11

Open MrUnknown789556 opened 1 year ago

MrUnknown789556 commented 1 year ago

I fully understand that Burdoc for now is a beta-version under development. As Burdoc is now, it is not able to extract headings from academic journal articles. With some articles it does a great job, other times (mostly) it is useless. It is certainly not the least reliable. I hope this topic will have a priority, when working on making Burdoc going from a beta version and ahead.

When Burdoc is not working properly, it either doesn't put any headings where they should be found in the JSON file, or it may give all the headings, but mixed with many different objects from the article in the extracted JSON file.

I use either "h5" or "h6" to look for headings in the generated JSON file. I use the CLI.

h5 = extractBetween(JSONfile, '"h5", "block_text": "' , '", "items":'); h6 = extractBetween(JSONfile, '"h6", "block_text": "' , '", "items":');

Sometimes the headings are also only found like here: {"type": "paragraph", "block_text": "UDEC MODELLING OF P-WAVE PROPAGATION ACROSS JOINTS", "items": [{"spans": [{"text": "UDEC MODELLING OF P-WAVE", "font": {"name": "font", "font":.

My impression is, that mostly all headings from an article are found by Burdoc, but they are impossible to be found from the JSON file by a program, because the headings are not stored in the JSON systematically, but spread within and "hidden" together normal text identification in the JSON.

I append to here a log file from a test run of several (816) randomly chosen academic articles. Some are not new, some are of more recent date.

I also append a few of the articles, where the extracted headings are not as expected. A single article ("Theoretical and Numerical Research on V-Cut Parameters and Auxiliary Cuthole Criterion in Tunnelling") also appended here, where heading were extracted as expected. Further PDF articles as listed in the log file can be supplied if requested (frank230458@yahoo.dk).

When this error in extracting headings from academic journal articles will be fixed, all headings should all be found in one place in the JSON, either under 'h5' or 'h6', not with some of the headings after h5, others after h6 or under other quite different ID's.

2023.07.06 Test of TOC.txt The effect of impact velocity and target thickness on ballistic performance of layered plates using Taguchi method-compressed.pdf The effect of shell material and load coefficient on the expansion of shell driven by detonation.pdf The effects of axial length on the fracture and fragmentation of expanding rings.pdf The energy absorption enhancement in aramid fiber-reinforced poly(benzoxazine-co-urethane) composite armors under ballistic impacts.pdf The influence of asymmetries in shaped charge performance.pdf Theoretical and Numerical Research on V-Cut Parameters and Auxiliary Cuthole Criterion in Tunnelling.pdf