Closed tayciryahmed closed 6 years ago
For your language model, the more data you use the better the estimations are going to be. I assume by lines you mean sentences? 2.5M is very small and there are much larger monolingual data available that you can use.
For NMT experiments in the preprocessing step we usually filter out words with very small frequencies. That decision is for you and it depends on the data that you have and the type of problems that you want to solve.
The reason the substitutions are empty is essentially because with the parameters and the models, the script could not come up with appropriate substitutions. Either the hyperparameters are set too high for the models you have, or I suspect that since your language model was trained on a small data, the prediction probabilities are smaller than the default hyperparameters.
Hi,
I have been trying to reproduce your paper on French - English translation and have some questions:
vocab_freq
file : I generate the list of word frequencies in my source file then I order them in an increasing order, then I take the first 59892 lines. I get words with frequencies from 1 to 11. However, most words have a frequency of appearance equal to 1. Is this normal? I see in your README that the frequencies are about 3-2000 ..Toy example:
Thanks.