After reading your paper, which is undoubtedly very much interesting, I gave a deep look into your code; I must admit that it is very well-organized. So thank you so much for your work.
Before starting my experimentation with it, I would like to know your suggestions about how to optimize the parameters of the system and of the training:
is there any better configuration of lm and lmnmt architectures with respect to the default?
for both lm and lmnmt training, which is the best configuration of the training parameters? (learning-rate, warmup-updates, etc)?
Should I pay particular attention to any aspect of the training to avoid bad performance?
As for lmnmt arch, I adopt the default fairseq config, and I find this is a strong baseline, especially for small scale translation task, such as IWSLT. As for lm arch, I did not do so much work on it. So I adopt a similar arch like their corresponding lmnmt arch for them.
As for lmnmt training, I adopt the same fairseq config. And for lm training, sorry that i have not found a good example, So i adopt the same config as lmnmt training.
After reading your paper, which is undoubtedly very much interesting, I gave a deep look into your code; I must admit that it is very well-organized. So thank you so much for your work.
Before starting my experimentation with it, I would like to know your suggestions about how to optimize the parameters of the system and of the training:
Should I pay particular attention to any aspect of the training to avoid bad performance?