to filter commonly used evasion technique, possibly merge the proceeding word with the current one after removing symbol. for example:
"this is a test" would check for: "thisis", "isa" "atest", as well as the individual words.
this will cause some false-positives, the main one being "puta", example: "I put a block here". In tests using str.find without spaces "puta" was found 1081 time out of the 4890 blacklist hits, more than "fuck" which was found 1024 times.
if the added checks aren't too costly, here is the most recent profiler dump:
2024-02-11 07:29:17: ACTION[Server]: Values below show absolute/relative times spend per server step by the instrumented function.
A total of 2306385 samples were taken
instrumentation | min µs | max µs | avg µs | min % | max % | avg %
-------------------------- | -------- | -------- | -------- | ----- | ----- | ------
filterplus: | 0 | 72973 | 16 | 0.0 | 92.3 | 0.3
- on_leaveplayer[1] .... | 1 | 788 | 4 | 0.0 | 19.1 | 0.1
- on_chat_message[1] ... | 5 | 72973 | 648 | 0.0 | 92.3 | 13.0
- /mute ................ | 289 | 361 | 332 | 1.2 | 18.9 | 10.5
- on_joinplayer[1] ..... | 1 | 2779 | 4 | 0.0 | 8.1 | 0.0
another string manipulation needed it removing repeating characters. no english word (that i know of) uses more than two repeating letters, so removing the additional repeats would be handy.
a quick writeup which mostly works:
local function remove_repeating(message)
return message:gsub("([%S]+)([%S])%1", "%2") -- keep doubled chars
end
print(
remove_repeating("hi assume this teeeeeesssssttttt aggregate okkkk")
)
bug with the above script is: for odd numbered repeats: 55555 it leaves one character, and with even numbered repeats: 4444 it leaves two characters, so many word variants would have to be added to blacklist, ex: one, oone, oonnee, onnee, etc.. this is not efficient but it's a starting point anyway
to filter commonly used evasion technique, possibly merge the proceeding word with the current one after removing symbol. for example:
"this is a test" would check for: "thisis", "isa" "atest", as well as the individual words.
this will cause some false-positives, the main one being "puta", example: "I put a block here". In tests using str.find without spaces "puta" was found 1081 time out of the 4890 blacklist hits, more than "fuck" which was found 1024 times.
if the added checks aren't too costly, here is the most recent profiler dump:
another string manipulation needed it removing repeating characters. no english word (that i know of) uses more than two repeating letters, so removing the additional repeats would be handy.
a quick writeup which mostly works:
bug with the above script is: for odd numbered repeats:
55555
it leaves one character, and with even numbered repeats:4444
it leaves two characters, so many word variants would have to be added to blacklist, ex: one, oone, oonnee, onnee, etc.. this is not efficient but it's a starting point anyway