marziehf / DataAugmentationNMT

Data Augmentation for Neural Machine Translation
MIT License
31 stars 12 forks source link

No substitutions generated for FR - EN translation #1

Closed tayciryahmed closed 6 years ago

tayciryahmed commented 6 years ago

Hi,

I have been trying to reproduce your paper on French - English translation and have some questions:

  1. How many lines do you use to learn the language model (I used 2.5M lines in French).
  2. Regarding the vocab_freq file : I generate the list of word frequencies in my source file then I order them in an increasing order, then I take the first 59892 lines. I get words with frequencies from 1 to 11. However, most words have a frequency of appearance equal to 1. Is this normal? I see in your README that the frequencies are about 3-2000 ..

... change 3028 taken 3007 large 2999 again 2994 ...

  1. When I run the substitution script I get no suggestions ! (all set are empty) I don't see what went wrong and why I don't get any substitution suggestions.

Toy example:

I enjoy it I {} enjoy{} it {}

Thanks.

marziehf commented 6 years ago

For your language model, the more data you use the better the estimations are going to be. I assume by lines you mean sentences? 2.5M is very small and there are much larger monolingual data available that you can use.

For NMT experiments in the preprocessing step we usually filter out words with very small frequencies. That decision is for you and it depends on the data that you have and the type of problems that you want to solve.

The reason the substitutions are empty is essentially because with the parameters and the models, the script could not come up with appropriate substitutions. Either the hyperparameters are set too high for the models you have, or I suspect that since your language model was trained on a small data, the prediction probabilities are smaller than the default hyperparameters.