Request: Symbols that can represent multiple letters

rion18 commented 3 months ago

Description

I've tried with a few combinations of EnglishTransformers, but I haven't been able to correctly censor words like sh*t or f*ck. In both cases, words should be censored, however, in the first word * represents an i and * represents a u. Is there a way to create a new transformer for multiple letters/regex?

Solution

I do not know how this can be implemented. Looking at the L33tspeak transformer, I can see there's a map per character:

    ['a', '@4'],
    ['c', '('],
    ['e', '3'],
    ['i', '1|'],
    ['o', '0'],
    ['s', '$'],

However, I don't know how it would work for multiple characters where for example, we could have

    ['*', 'any_letter_or_vowel_etc.'],

Code of Conduct

[X] I agree to follow this project's Code of Conduct.

jo3-l commented 3 months ago

Allowing one character to map to multiple characters is intractable with Obscenity's current design, and I do not think it is something we would like to support--we would need to test patterns against all possible transformed strings instead of just one, potentially degrading performance significantly. (Consider the input text **** with * -> any of aeiou, for instance: there are 5^4 = 625 possible transformed strings. With adversarial inputs this could be disastrous.)

The correct way to solve this issue is to either adjust the patterns (that is, add a pattern that matches directly on the text sh*t), or to strip out the * with a transformer. Some previous versions of Obscenity actually correctly identified a match in the input f*ck using the second approach (see the skipNonAlphabetic transformer, disabled by default due to #46.) It has always been my goal to eventually add the skipNonAlphabetic transformer back after fixing that issue, but I have not gotten to it yet.

With these considerations in mind, I am inclined to close this specific request as wontfix, but I think the intent of your issue is actually already tracked in #46--so, to be clear, I do hope that eventually Obscenity's detection quality can be improved to catch the cases you mention, just not in the manner you propose. Does that sound reasonable to you?

jo3-l commented 3 months ago

For some context on why I have not yet fixed #46, the code dealing with transformations and matching is some of the more nasty code in this package, in part due to its age--I would have done things differently now compared to 3 years ago--and in part due to the complexity in mapping match positions in the transformed text back to the original text in a Unicode-aware way. (Working in both Unicode code points and UTF-16 code units depending on context makes this even nastier.) Consequently, for some time, this code was on rather shaky ground (as you observed in #71), and I was very reluctant to adjust it for fear of breaking it more.

Recently, however, after your previous report I took the time to rework some of the code and fuzz tested it in https://github.com/jo3-l/obscenity/commit/25bd1db98591f4be75a9787aafcec4d850d0ddc0 and am now much more confident that things are as they should be. Addressing #46 should, I think, be considerably more straightforward after this, and it's possible we can do it in the next release.

rion18 commented 3 months ago

I did read #46, but not necessarily tied it to the use case presented here. I'll add asterisks in my word inputs since that will work for the moment as per your suggestion.

Thanks a lot for your hard work!!

jo3-l commented 3 months ago

I did read https://github.com/jo3-l/obscenity/issues/46, but not necessarily tied it to the use case presented here.

That's fair. The title of #46 is a little misleading at the moment; it's more of a tracking issue to get the skipNonAlphabetic transformer re-enabled by default since the original problem there was fixed.

jo3-l commented 3 months ago

For ease of tracking, I'm going to close this in favor of #46, which I just renamed to better reflect the current state of that issue. As discussed above, the suggestion ultimately presented there is a directly actionable way of solving the same problem in your original issue. Thanks!

jo3-l / obscenity