globalgov / manypkgs

Support for creating new manyverse packages
https://globalgov.github.io/manypkgs/
GNU Affero General Public License v3.0
2 stars 0 forks source link

OCR treaty pdfs #18

Closed jhollway closed 2 years ago

jhollway commented 3 years ago

Consider image binarization for removing grayscale elements from scanned text for improved optical character recognition.

How does quanteda do this?

jhollway commented 3 years ago

{pdftools}, {tesseract}, or both?

henriquesposito commented 2 years ago

@jhollway do you have any specific examples of datasets you want to incorporate that have their treaty texts in pdf formats? I will start to work on a function for OCR pdf texts. Thank you.

jhollway commented 2 years ago

ECOLEX has PDFs.