lk-geimfari / mimesis

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
https://mimesis.name
MIT License
4.34k stars 326 forks source link

Remove inappropriate words from your random text selections #1511

Closed eklicious closed 3 months ago

eklicious commented 3 months ago

Feature request

Go through all your data sets and remove inappropriate words.

Thesis

For example, text.json contains words like 'milf' and 'milfhunter'. Those need to be removed because customers end up seeing this in their sample data sets and this doesn't make anyone look good for anyone.

Reasoning

If you want companies using your tool, you need to cleanse the data.

lk-geimfari commented 3 months ago

I completely agree with this. The problem is that this data was collected all over the internet and not by me alone, and obviously I haven't seen all the data and verified it. It's also worth noting that this kind of data got there by accident.

I take this problem seriously. Fixing it will be a top priority for the next release.

lk-geimfari commented 3 months ago

Well, I removed everything I found using: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master

I hope this will improve the quality of the datasets and there won't be bad words in them, but I can't guarantee it because I can't check all the datasets, word by word. Can't do it physically.

lk-geimfari commented 3 months ago

Version 16.0.0 with fixes is available now.