Update resolve-leetspeak/dictionary.ts

HatScripts commented 9 months ago

Type of change:

[ ] Refactor
[ ] Performance improvement
[x] New feature
[ ] Bug fix
[ ] Other (please describe):

Please describe the changes this PR makes and why it should be merged:

Updated the leetspeak dictionary to include:

'6' → 'b'
'8' → 'b'
'9' → 'g'
'!' → 'i'
'5' → 's'
'7' → 't'
'2' → 'z'

Some of these replacements might be overkill, so please let me know what you think. Further reading: https://en.wikipedia.org/wiki/Leet#Table_of_leet-speak_substitutes_for_normal_letters

Status:

[ ] I've added/modified unit tests relevant to my change / not needed
[ ] This PR contains breaking changes
[ ] This PR doesn't include changes to the code

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (ad51d19) 100.00% compared to head (8c77c76) 100.00%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #51 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 27 27 Lines 505 505 Branches 92 92 ========================================= Hits 505 505 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

HatScripts commented 9 months ago

Also, I noticed that sh|t isn't being censored. Not sure why. Terms like f@g, fu(k, c0ck, etc. are being correctly censored.

Edit: This is happening because | (vertical bar) is being replaced by l (lowercase L) in resolve-confusables: https://github.com/jo3-l/obscenity/blob/ad51d193ef23a52b685ee5e5c603b456ed9130b2/src/transformer/resolve-confusables/confusables.ts#L70

jo3-l commented 9 months ago

Sorry, but I'm not convinced these substitutions are clearly valuable, with the exception of perhaps ! -> i. When we add back whitespace stripping, for instance, 6 itch will be flagged as containing bitch which seems a rather egregious false positive. I'm sure there are many more examples in this vein, so would prefer to err on the side of caution here unless you have a compelling reason otherwise.

HatScripts commented 9 months ago

6 itch will be flagged as containing bitch

I could be wrong, but to me, this seems like an argument for including .addWhitelistedTerm('b itch'), not for excluding the leetspeak.

However, I'm having trouble even finding a word ending in b that would appear before the word itch. The only thing I can come up with for now is scab itch.

jo3-l commented 9 months ago

To be clear, my point is that the phrase 6 itch -- not including the b originally -- would be flagged as containing profanity, because 6 would be remapped to b. That seems a clear mistake to me.

HatScripts commented 9 months ago

Oh okay, that makes sense. Forgive my confusion.

HatScripts commented 9 months ago

Just to clarify, assuming resolve-leetspeak is ran before the check for whitelisted terms, wouldn't .addWhitelistedTerm('b itch') still be enough in the case of 6 itch to ignore the false positive?

jo3-l commented 9 months ago

assuming resolve-leetspeak is ran before the check for whitelisted terms

This is not the case; whitelisted term matching runs essentially on the original text by default (with only lowercasing applied.) I am not sure this is something we want to change.

jo3-l commented 8 months ago

Per my previous comments I'm not sure this is worth doing, so closing for now. Happy to discuss if you disagree, as always.

jo3-l / obscenity

Update resolve-leetspeak/dictionary.ts #51

Codecov Report