The swears.txt file is English-only, but I like watching foreign-language films, so a multilingual dataset of bad words would be great. LDNOOBW by Shutterstock is probably the best dataset I've come across that could do the job.
Seeing as we already have a --lang flag, perhaps we could extend it to select which language(s) to search for bad words? The one problem would be when some video has more than one language in it - thoughts?
The
swears.txt
file is English-only, but I like watching foreign-language films, so a multilingual dataset of bad words would be great. LDNOOBW by Shutterstock is probably the best dataset I've come across that could do the job.Seeing as we already have a
--lang
flag, perhaps we could extend it to select which language(s) to search for bad words? The one problem would be when some video has more than one language in it - thoughts?