Closed jhollway closed 2 years ago
{pdftools}
, {tesseract}
, or both?
@jhollway do you have any specific examples of datasets you want to incorporate that have their treaty texts in pdf formats? I will start to work on a function for OCR pdf texts. Thank you.
ECOLEX has PDFs.
Consider image binarization for removing grayscale elements from scanned text for improved optical character recognition.
How does quanteda do this?