scramblingbalam / F2016_EECS595_NLP

Programing assignments for Natural Language Processing
0 stars 0 forks source link

Log probability of sentances #9

Open scramblingbalam opened 7 years ago

scramblingbalam commented 7 years ago

Use your models to find the log-probability, or score, of each sentence in the Brown training data with each n-gram model. This corresponds to implementing the score() function. Make sure to accommodate the possibility that you may encounter in the sentences an n-gram that doesn’t exist in the training corpus. This will not happen now, because we are computing the log-probabilities of the training sentences, but will be necessary for question 5. The rule we are going to use is: if you find any n-gram that was not in the training sentences, set the whole sentence log-probability to -1000 (Use constant MINUS_INFINITY_SENTENCE_LOG_PROB).

scramblingbalam commented 7 years ago

The code will output scores in three files: “output/A2.uni.txt”, “output/A2.bi.txt”, “output/A2.tri.txt”. These files simply list the log-probabilities of each sentence for each different model. Here’s what the first few lines of each file looks like: A2.uni.txt -178.726835483 -259.85864432 -143.33042989 A2.bi.txt -92.1039984276 -132.096626407 -90.185910842 A2.tri.txt -26.1800453413 -59.8531008074 -42.839244895

scramblingbalam commented 7 years ago

Now, you need to run our perplexity script, “perplexity.py” on each of these files. This script will count the words of the corpus and use the log-probabilities computed by you to calculate the total perplexity of the corpus. To run the script, the command is: python perplexity.py Where is one of the A2 output files and is “data/Brown_train.txt”. Include the perplexity of the corpus for the three different models in your README. Here’s what our script printed when was “A2.uni.txt”. python perplexity.py output/A2.uni.txt data/Brown_train.txt The perplexity is 1052.4865859