Open zhl5842 opened 2 years ago
because each time the sample is the same.
Hi @zhl5842 ,
Thank you for your question.
In this work, we leverage Monte Carlo Dropout to obtain samples of sentence-level translation probability. You can find more details in the link.
Ignoring all those math details, we run K
forward passes for the same input with the dropout activated when using Monte Carlo Dropout. That is, for each forward pass, the model makes predictions with a random subset of parameters. Hence, the model prediction e
s are different across K
inferences, even though the input is the same, and we take the expectation over multiple inferences to obtain predictive posterior.
Hi @zhl5842 ,
Thank you for your question.
In this work, we leverage Monte Carlo Dropout to obtain samples of sentence-level translation probability. You can find more details in the link.
Ignoring all those math details, we run
K
forward passes for the same input with the dropout activated when using Monte Carlo Dropout. That is, for each forward pass, the model makes predictions with a random subset of parameters. Hence, the model predictione
s are different acrossK
inferences, even though the input is the same, and we take the expectation over multiple inferences to obtain predictive posterior.
hi, but you have used model.eval() in the code , and dropout would don't work ? https://github.com/huawei-noah/noah-research/blob/master/noahnmt/multiuat/fairseq/fairseq/tasks/multiuat-multilingual-translation.py#L381
Hi @zhl5842 ,
Thank you for pointing out this problem.
You're right, and I believe this is a mistake in releasing the code. Perhaps I handed a wrong version of the code over to my colleague when releasing the code, because I checked my own private code and didn't find this line. This line of code can be safely removed.
I will re-run all the related experiments to make sure everything is correct. I think this is actually an interesting analysis to measure the effect of Monte Carlo Dropout.
I will keep you updated as soon as I get the results (within 24 hours).
I am no longer working at HUAWEI and have contacted my colleague to fix this problem.
Thank you again.
OK @minghao-wu , and I had run the result of K =1 and find both results(K=1 and K=10) are similar. looking forwards to your results!
In addition, I find, comparing to multidds , multiuat has the init --lr =5e-04, --weight-decay 0.0001, but multidds has the init --lr =2e-04, --attention-dropout 0.3 --relu-dropout 0.3 --weight-decay 0.0, I had reproduce the results, which are similar to the paper released, But when I keep these parameters same(--lr =5e-04, --weight-decay 0.0001), the results of multidds and multiuat are also similar, so the Uncertainty-Aware method has nothing advantage ?
Hi @zhl5842 ,
With the limited computational resources and the overly large hyperparameter search space, I didn't do the hyperparameter searching for multidds-s and directly used their recommended hyperparameters, assuming their recommended hyperparameters are optimal for their own approach. In fact, I didn't successfully reproduce their results with their recommended hyperparameters and my own implementation, so that I use their reported results in the main content of my paper to have a fair comparison, with the same assumption. You know, it's very hard to have a 100% perfect re-implementation. Their released code has compatibility issues on our computational hardwares. To be honest, I am also surprised that my hyperparameters work well for their approach.
As I mentioned in my paper, most of our observations are consistent with Wang et al., 2020 and the main focus of my work is on the multi-domain NMT. We find out that multidds-s is vulnerable in multi-domain NMT and our approach multiuat works reasonably well for both multilingual and multi-domain NMT. Given the fact that the text corpora may come from heterogeneous sources, our approach is a safer and better choice. That is, the strength of multiuat is mainly demonstrated on the multi-domain NMT.
No advantage? Yes and No, multidds-s and multiuat may have similar performance in multilingual NMT (as shown in Figure 1 and your own results), but multiuat is definitely a better choice when you do not have sufficient understanding about your datasets.
Hi @minghao-wu First I don't know why you change the hyperparameter LR, I suppose if we use the multidds hyperparameter, and the results will be also the same, so it can't prove multiuat is effective
Second, for multi-domain NMT, I think your method maybe effective , but I think you should keep the same hyperparameter and see the difference of the results , otherwise you can't convince me, like the parameter K.
thanks , looking forwards to your new results!
Hi @zhl5842 ,
Firstly, as I mentioned before, I assumed their recommended hyperparameters are optimal for their approach, so I directly followed their recommendations. I didn't apply my hyperparameters to their approach. When the resources are limited, carefully tuning others' work is not my first priority.
Secondly, in multi-domain NMT, I use the same hyperparameters for both multidds and multiuat, which are tuned by myself, because Wang et al. (2020) didn't apply multidds to multi-domain NMT. With the identical setup in multi-domain NMT experiments, there is a significant difference between these two approaches.
I re-ran the multiuat with and without Monte Carlo Dropout on the multilingual M2O diverse setup, and find out that the Monte Carlo Dropout only has a marginal effect to the final performance. The choice of K doesn't make a big difference either. The improvement mainly comes from the algorithm itself. A smart choice of reward function can make the training more robust.
Hi @minghao-wu Firstly, the parameter K is useless, It is verified secondly, I think, for multilingual NMT, the improved result is only from increasing the LR(2e-04 to 5e-04), the main reason is that the multidds is not well trained. Actually, multiuat (LR=5e-04 and Uncertainty ) comparing to multidds(2e-04 and grad cross sim) gets the improve , and your paper says this mainly from Uncertainty, I don't think so, and the results are unbelievable. I am sorry for it, but it is actual experimental results.
Hi @zhl5842 ,
I don't think we have a huge disagreement on the multilingual NMT results. It's indeed a great finding that mulitdds can be improved with my hyperparameters and I do encourge you to keep working on it. As I mentioned over and over again, I assumed the recommended hyperparameters from Wang et al (2020) and didn't tune hyperparameters for multidds. Tuning hyperparameters for others is not my top priority. I compared my results with their reported results. If their reported results is under-performed, I'm not the one to be blamed.
Again, I have been saying this for many times and don't mind emphasize it again. The main focus of our work is that multiuat is robust against the change of datasets and multidds is not. All of our experiments are designed for this argument. The core value of our work is about our findings on the multi-domain NMT. The updated multilingual NMT results of multidds do not change its vulnerability on multi-domain NMT.
Don't be sorry. You just don't care about the valuable part of my work. It's your loss, not mine.
I won't continue this inconstructive conversation.
K=10 and each e is the same。。。