shane-mason / essential-generators

68 stars 8 forks source link

Controversial words and throublesome pairs in generated text #5

Closed TodorovicSrdjan closed 6 months ago

TodorovicSrdjan commented 6 months ago

Is there a way to avoid generation of text which contains controversial words like "ISIS"?

I have used this package for some fake data generation and I'm worried that generated text might get someone on blacklist or watchlist because of those words, depending on the use case.

In my case, there were around 10 instances where the text mentioned the word "ISIS".

Some examples:

"Join isis" and "isis support" also seem like common pairs, so it looks like a real issue which could get someone into trouble.

Here is some code snippet which generates those sentances:

from essential_generators import DocumentGenerator

gen = DocumentGenerator()
for _ in range(10):
    random_text = gen.paragraph(1, 3)
    print(random_text)
shane-mason commented 6 months ago

These are good points - the model it pulls from was generated using some random articles from Wikipedia, so there can be all sorts of stuff in there. You can build a new word model from a corpus of only 'safe words' - though the documentation I've currently provided is pretty thin:

Essential Generator ships with text and word models built from various Wikipedia articles. There are three scripts included to help you generate new models:

  • build_corpus.py - Retrieves specified articles from wikipedia to use when training the models. Default output is 'corpus.txt'.
  • build_text_model.py - Uses corpus.txt to output markov_textgen.json as the text model for sentences and paragraphs.
  • build_word_model.py - Uses corpus.txt to output markov_wordgen.json as the word model (for words, email, domains etc)

Those files are in the 'test' directory. You don't have to use Wikipedia articles from 'build_corpus.py' - any flat text file full of words will do. If you take this route and have any problems, let me know, and I'll try to help.

Here is what I will do to try and improve this:

I'll update the model to have safer text. I recently made a video about a similar generator I wrote in Perl. I used a large corpus of public-domain fiction. I'll use that corpus to train a new model.

I'll add a 'word-group filter' that will let you specify a list of words to be filtered out.

I'll try to get those out very soon.

shane-mason commented 6 months ago

Okay, I just committed a 'banned_words' property on DocumentGenerator. This will prevent banned words from showing up in output that uses the Markov text generator but will NOT stop them from being randomly generated using the statistic and word gen models (though it unlikely those would randomly be generated) - will look at extending this further in the future. This should address your main issue. Here is how you use it:

from essential_generators import DocumentGenerator
gen = DocumentGenerator(banned_words=["ISIS", "isis"])
paragraph = gen.paragraph()
print(paragraph )

Currently, it looks for exact matches to skip, so you'll need to provide all the permutations of capitalization you want to ban.