Closed rion18 closed 3 months ago
Allowing one character to map to multiple characters is intractable with Obscenity's current design, and I do not think it is something we would like to support--we would need to test patterns against all possible transformed strings instead of just one, potentially degrading performance significantly. (Consider the input text ****
with * -> any of aeiou
, for instance: there are 5^4 = 625 possible transformed strings. With adversarial inputs this could be disastrous.)
The correct way to solve this issue is to either adjust the patterns (that is, add a pattern that matches directly on the text sh*t
), or to strip out the *
with a transformer. Some previous versions of Obscenity actually correctly identified a match in the input f*ck
using the second approach (see the skipNonAlphabetic
transformer, disabled by default due to #46.) It has always been my goal to eventually add the skipNonAlphabetic
transformer back after fixing that issue, but I have not gotten to it yet.
With these considerations in mind, I am inclined to close this specific request as wontfix, but I think the intent of your issue is actually already tracked in #46--so, to be clear, I do hope that eventually Obscenity's detection quality can be improved to catch the cases you mention, just not in the manner you propose. Does that sound reasonable to you?
For some context on why I have not yet fixed #46, the code dealing with transformations and matching is some of the more nasty code in this package, in part due to its age--I would have done things differently now compared to 3 years ago--and in part due to the complexity in mapping match positions in the transformed text back to the original text in a Unicode-aware way. (Working in both Unicode code points and UTF-16 code units depending on context makes this even nastier.) Consequently, for some time, this code was on rather shaky ground (as you observed in #71), and I was very reluctant to adjust it for fear of breaking it more.
Recently, however, after your previous report I took the time to rework some of the code and fuzz tested it in https://github.com/jo3-l/obscenity/commit/25bd1db98591f4be75a9787aafcec4d850d0ddc0 and am now much more confident that things are as they should be. Addressing #46 should, I think, be considerably more straightforward after this, and it's possible we can do it in the next release.
I did read #46, but not necessarily tied it to the use case presented here. I'll add asterisks in my word inputs since that will work for the moment as per your suggestion.
Thanks a lot for your hard work!!
I did read https://github.com/jo3-l/obscenity/issues/46, but not necessarily tied it to the use case presented here.
That's fair. The title of #46 is a little misleading at the moment; it's more of a tracking issue to get the skipNonAlphabetic
transformer re-enabled by default since the original problem there was fixed.
For ease of tracking, I'm going to close this in favor of #46, which I just renamed to better reflect the current state of that issue. As discussed above, the suggestion ultimately presented there is a directly actionable way of solving the same problem in your original issue. Thanks!
Description
I've tried with a few combinations of EnglishTransformers, but I haven't been able to correctly censor words like
sh*t
orf*ck
. In both cases, words should be censored, however, in the first word*
represents ani
and*
represents au
. Is there a way to create a new transformer for multiple letters/regex?Solution
I do not know how this can be implemented. Looking at the L33tspeak transformer, I can see there's a map per character:
However, I don't know how it would work for multiple characters where for example, we could have
Code of Conduct