Provide path for client to redact text from Tesseract OCR

Hey team, our content removal policies allow for folks to request that we remove their identifying information from the content we upload to our repositories (this is usually a GDPR request). In the past, we've complied by blacking out the personal info, like a name, in the attached PDF and deleting the same text from the separate OCR file. Now that we're running works through Tesseract on upload, can you help us determine a way that we can remove discrete chunks of text from the Tesseract OCR?

The solution can be something as simple as "delete the attached work and reupload with a redacted PDF that Tesseract runs, skipping the masked text." I just want to make sure that we know the best path and it's guaranteed to work (meaning the text will truly be gone from the repository), since we have a policy that requires us to remove information in response to reasonable or GDPR-related requests.

Thanks!

scientist-softserv / adventist_knapsack

Provide path for client to redact text from Tesseract OCR #296