useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

Extract text from images using tesseract-ocr #14

Open ubmarco opened 2 years ago

ubmarco commented 2 years ago

To detect headlines (see Issue #13) also the font size and style (bold/italic) should be extracted. See https://stackoverflow.com/questions/39324626/get-font-size-in-python-with-tesseract-and-pyocr and https://github.com/tesseract-ocr/tesseract/issues/1074 and in the issue especially the comment https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-965063440.

ubmarco commented 2 years ago

https://github.com/ocrmypdf/OCRmyPDF