Missing Training Script?

vzhou842 / profanity-check

A fast, robust Python library to check for offensive language in strings.

https://pypi.org/project/profanity-check

MIT License

612 stars 113 forks source link

Missing Training Script? #3

Open cathalgarvey opened 5 years ago

cathalgarvey commented 5 years ago

Hey, I read your blogpost about profanity-check, so I've seen the code there.. but I'm wondering whether you have a file separately to that for training? And/or one for validation or "benchmarking"?

If so, I'd love to see those in the repo. :)

vzhou842 commented 5 years ago

Hey, thanks for the comment. I do have all of that code but unfortunately it's a bit scattered and not really in good shape to be uploaded to the repo. If anyone else is interested in seeing this, please comment on this issue! I'll clean up my code and upload it if a few people want to see it.

cathalgarvey commented 5 years ago

Welp, it's something I'd be interested in playing with, potentially contributing towards, if you do ever get around to sharing it. :)

alexandrduduka commented 5 years ago

@vzhou842, thank you for your awesome job, it's really admirable! I would be interested to see code as well. Is a piece of code mentioned in the article enough to retrain the model? I want to feed it more data and change requirements a bit (need to check not only for profanities, but for some more stuff). Also I would like to ask: profanity-filter library claims to use deep analysis to identify cases with misspelling, do you think it is possible to somehow apply this approach on top of your library to improve preciseness in the cost of speed? As far as I understand it, it shouldn't be possible, cause you do not identify any "black list" directly, thus we do not have anything to convert, but maybe there is some other way I do not see? Having larger dataset with popular misspelling cases doesn't seem to fully resolve the issue, as there are just too many ways to misspell each word. Sorry if my questions are profane, I am just starting getting into machine learning :-)

vzhou842 commented 5 years ago

@alexandrduduka this library is based on scikit-learn's LinearSVC class, so I'd recommend playing with that if you want to reproduce something similar.

As far as improving precision, there are lots of ways to do that (all of which would come at the cost of speed). That's too big of a question for me to answer concisely, but basically you'd have to use more complex / powerful models and possible use better / more data preprocessing.

adarsa commented 5 years ago

@vzhou842 Thank you for the model. Would like to see the script for training and benchmarking you have presented. Looking forward to being able to contribute, extend this.