slovak-egov / CRZ-scraper

Web scraping and filtering code for slovak contract database - crz.gov.sk. The code downloads XML databases, creates a CSV database of contracts, filters them, downloads the files, extracts and cleans up tables with MD rates.
5 stars 2 forks source link

Extraction of tables from scanned documents #3

Open mtihanyi opened 2 years ago

mtihanyi commented 2 years ago

The engine used for extraction of tables from PDF files is a well-known Python library called camelot. However, this library requires that the processed PDF file contains text ("computer" text, not just scanned image of text). It searches for certain descriptors designating a table, which is the prerequisite to extract one. The code also produces TXT files from both textual and scanned PDF files, but this obviously loses any metadata, descriptors or advanced fields required for camelot to work.

jakubox commented 2 years ago

how about using OCR like Tesseract instead of camelot? A bit of work building the table from OCR but I don't see a way around this. Or rebuild camelot to camelot 2.0 that can get scanned tables ;)

mtihanyi commented 2 years ago

how about using OCR like Tesseract instead of camelot? A bit of work building the table from OCR but I don't see a way around this. Or rebuild camelot to camelot 2.0 that can get scanned tables ;)

Tesseract is already used in OCR process before tables extraction. However, I don't see how to find a table in TXT output (what are the indices). As for camelot, I do not intend to interfere. I seem to have found various articles for this issue, so I will look into it in the upcoming months.