useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

Detect headlines in PDFs without outline #13

Open ubmarco opened 2 years ago

ubmarco commented 2 years ago

Look at https://github.com/ChrizH/pdfstructure - it implements a pdfminer based solution that checks the font style of each lines and checks for prepended chapter numbers. Here is an article about the solution: https://medium.com/@_chriz_/development-of-a-structure-aware-pdf-parser-7285f3fe41a9

ubmarco commented 2 years ago

This looks also worth testing: https://github.com/kermitt2/grobid