Open scramblingbalam opened 7 years ago
The code will output scores in three files: “output/A2.uni.txt”, “output/A2.bi.txt”, “output/A2.tri.txt”. These files simply list the log-probabilities of each sentence for each different model. Here’s what the first few lines of each file looks like: A2.uni.txt -178.726835483 -259.85864432 -143.33042989 A2.bi.txt -92.1039984276 -132.096626407 -90.185910842 A2.tri.txt -26.1800453413 -59.8531008074 -42.839244895
Now, you need to run our perplexity script, “perplexity.py” on each of these files. This script will
count the words of the corpus and use the log-probabilities computed by you to calculate the
total perplexity of the corpus. To run the script, the command is:
python perplexity.py
Use your models to find the log-probability, or score, of each sentence in the Brown training data with each n-gram model. This corresponds to implementing the score() function. Make sure to accommodate the possibility that you may encounter in the sentences an n-gram that doesn’t exist in the training corpus. This will not happen now, because we are computing the log-probabilities of the training sentences, but will be necessary for question 5. The rule we are going to use is: if you find any n-gram that was not in the training sentences, set the whole sentence log-probability to -1000 (Use constant MINUS_INFINITY_SENTENCE_LOG_PROB).