Open mtihanyi opened 3 years ago
how about using OCR like Tesseract instead of camelot? A bit of work building the table from OCR but I don't see a way around this. Or rebuild camelot to camelot 2.0 that can get scanned tables ;)
how about using OCR like Tesseract instead of camelot? A bit of work building the table from OCR but I don't see a way around this. Or rebuild camelot to camelot 2.0 that can get scanned tables ;)
Tesseract is already used in OCR process before tables extraction. However, I don't see how to find a table in TXT output (what are the indices). As for camelot, I do not intend to interfere. I seem to have found various articles for this issue, so I will look into it in the upcoming months.
The engine used for extraction of tables from PDF files is a well-known Python library called camelot. However, this library requires that the processed PDF file contains text ("computer" text, not just scanned image of text). It searches for certain descriptors designating a table, which is the prerequisite to extract one. The code also produces TXT files from both textual and scanned PDF files, but this obviously loses any metadata, descriptors or advanced fields required for camelot to work.