scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

Provide path for client to redact text from Tesseract OCR #296

Open KatharineV opened 1 year ago

KatharineV commented 1 year ago

Hey team, our content removal policies allow for folks to request that we remove their identifying information from the content we upload to our repositories (this is usually a GDPR request). In the past, we've complied by blacking out the personal info, like a name, in the attached PDF and deleting the same text from the separate OCR file. Now that we're running works through Tesseract on upload, can you help us determine a way that we can remove discrete chunks of text from the Tesseract OCR?

The solution can be something as simple as "delete the attached work and reupload with a redacted PDF that Tesseract runs, skipping the masked text." I just want to make sure that we know the best path and it's guaranteed to work (meaning the text will truly be gone from the repository), since we have a policy that requires us to remove information in response to reasonable or GDPR-related requests.

Thanks!