Hey team, our content removal policies allow for folks to request that we remove their identifying information from the content we upload to our repositories (this is usually a GDPR request). In the past, we've complied by blacking out the personal info, like a name, in the attached PDF and deleting the same text from the separate OCR file. Now that we're running works through Tesseract on upload, can you help us determine a way that we can remove discrete chunks of text from the Tesseract OCR?
The solution can be something as simple as "delete the attached work and reupload with a redacted PDF that Tesseract runs, skipping the masked text." I just want to make sure that we know the best path and it's guaranteed to work (meaning the text will truly be gone from the repository), since we have a policy that requires us to remove information in response to reasonable or GDPR-related requests.
Hey team, our content removal policies allow for folks to request that we remove their identifying information from the content we upload to our repositories (this is usually a GDPR request). In the past, we've complied by blacking out the personal info, like a name, in the attached PDF and deleting the same text from the separate OCR file. Now that we're running works through Tesseract on upload, can you help us determine a way that we can remove discrete chunks of text from the Tesseract OCR?
The solution can be something as simple as "delete the attached work and reupload with a redacted PDF that Tesseract runs, skipping the masked text." I just want to make sure that we know the best path and it's guaranteed to work (meaning the text will truly be gone from the repository), since we have a policy that requires us to remove information in response to reasonable or GDPR-related requests.
Thanks!