inclusivenaming / website

Website for the Inclusive Naming Initiative
https://inclusivenaming.org/
Creative Commons Attribution 4.0 International
28 stars 39 forks source link

Ponder some solutions to automate searching for terms in content #169

Open waynebeaton opened 5 months ago

waynebeaton commented 5 months ago

It would be handy to have a regular expression that one could use to run a quick scan on content to ensure that words in a particular set of tiers do not appear. I maintain, for example, some documentation in AsciiDoc that really lends itself well to this sort of search.

I cobbled this expression together based on the contents of the "term" field in each wordlist file:

$ grep -Pi "\b(?:black[\-\s]?box|Blackout|disable|fellow|master\s?mind|white\s?box|white[\-\s]?label|test1|cripple|master|slave|master|abort|blackhat|whitehat|Tribe|white[\-\s]?list|sanity(?:\-|\s)check|hallucinate|man\-in\-the\-middle|Segregate)\b" -R .

You'll also notice in my expression that I accounted for some variations ("sanity check" and "santity-check"; "black box", "black-box", and "blackbox"; ...), and broke apart some of the combinations ("whitehat-blackhat" became "whitehat" and "blackhat"). I didn't get them all, and there may be some errors (I haven't paid any attention to tiers, for example).

In typical fashion, I'm probably overthinking this... Since the lists are dynamic and I expect will change over time, I'm thinking that the expression should be generated automatically based on the wordlist files in /content/word-lists. It should be relatively straightforward to leverage Hugo to build an expression from this data, or from the content in the JSON word-list.

Having some consistency in the way that terms are captured would make this a lot more useful. The "master-slave" and "whitehat-blackhat" entries don't lend themselves well to the automation (I understand why combining them makes sense). Perhaps adding some version of a variations field could provide a simple solution.