Untrained on different formats of profanity.

vzhou842 / profanity-check

A fast, robust Python library to check for offensive language in strings.

https://pypi.org/project/profanity-check

MIT License

612 stars 113 forks source link

Untrained on different formats of profanity. #2

Closed KevinTyrrell closed 5 years ago

KevinTyrrell commented 5 years ago

The reason most word lists are not good enough to stop profanity is because people can format the words different than something that would be expected.

https://github.com/vzhou842/profanity-check/releases/tag/v1.0.2 fails to stop this behavior.

e.g.

!@#$%^ <-- 6 letter offensive word, caught by profanity-check

! @ # $ % ^ <-- same word, spaced, 12.1% certainty of profanity.

vzhou842 commented 5 years ago

Hey, you're absolutely right! There's a lot that profanity-check won't catch because clever reformats of profanity won't appear in training datasets :(. This isn't what profanity-check is designed for, though - its main focus is being smarter and more robust than traditional wordlists while also being more performant and having a lower footprint than more complex ML solutions.