kappapiana / anonymize

A script to change authorship to ODT and DOCX comments, redlines and whatnot.
26 stars 5 forks source link

Unwanted replacements in document text #19

Closed yamanq closed 2 years ago

yamanq commented 2 years ago

Currently, the replace function can also modify text outside of the comment author (comment text, document text), which could be unwanted behavior. The replace function should be expanded with some kind of helper that integrates the search regex into the actual replacement. Ideally, the replace function would be generalized to allow for multiple types of tags rather than just comments. One step towards this would be converting the regex variables into a dictionary of regex expressions so that they can be programmatically accessed. This would also help minimize code duplication across delete_initials, delete_dates, and replace. I imagine the function signature would look something like this: def replace(self, from_string, to_string, expression_type)

kappapiana commented 2 years ago

A) your idea to create a dictionary seems cool

B) How can possibly these regexp target stuff outside? They are not much greedy and so far include tags that are not found elsewhere, as per my analysis. But if we find a way to be more selective, I'm fine with it.

yamanq commented 2 years ago

I think the initial search for comments is specific enough, but the replace function only takes the author name, without the surrounding tag. Here is an example where I replace "Unknown Author" with "Anonymized Author" image image

This proposal aims at integrating the search regex into the replace function.

kappapiana commented 2 years ago

You are right. I thought I had tackled this issue, months ago, probably in my dreams only.