These Tasks create text that mimics genuine English by using trigrams to create a model of English text patterns. It demonstrates how to use python in a few Project Gutenberg manuscripts to create,analyse,and export a trigram-based generator.
A trigram is a sequence of three characters in text, like "THE" or "AND". We can simulate patterns in the English language by counting the instances of each trigram. Then we can produce fresh text that "sounds" like English by beggining with a seed and using these trigrams to generate text character by character.
Task1 - Builds the Trigram model Loads English texts from Project Gutenberg, cleans them,and counts each trigram (three-character sequence). The resulting model captures common patterns in the language.
Task2 - Generates text Using the trigram model, it generates a 10,000 character string that sounds familiar with English text. Starting with "TH",it picks each next character based on probabilites from the model, builds a text that refelcts the style of the original works.
Task3 - Analysis of the Generated text To see how "English-like" the generated text is, it compares it agains a list of valid English words in (words.txt). By calculating the percentage of valid words in the generated text,we can measure how closely the model replicates real English.
Task4 - Exporting the Model as JSON Exports the trigram model as JSON file (trigrams.json). This allows us to use it in the project later on.
Clone the repository in terminal run command
Intall any necessary libraries (if they are not installed already) run command
Open trigrams.ipynb in Jupyter Notebook to run each cell and follow the steps.
Each task in the project structured in the Jupyter Notebook has a clear explanation and code blocks. Open trigrams.ipynb and execute each cell in order. You will be able to see:
After the completion of task 4, you will have a trigrams.json file in the /Data directory. This file contains the trigram model, saved in JSON format, with each trigram and its frequency count.
Here is a short example of what the generated text might look like ( each run will change slightly):
"TH IS A SA MP LE OF TE XT TH AT LO OKS LI KE EN GL ISH BUT MI GH T NO T MA KE S EN SE"
The generated text will likely contain between 30 - 50 % valid english words in this case using the source texts in /Data directory.
We have shown through these tests that a trigram-based model can estimate language patterns and produce English-structured text. It is a very straightforward method and it can be reusable for more text production and natural language processing strategies.