reproducibility-challenge / iclr_2019

ICLR Reproducibility Challenge 2019
https://reproducibility-challenge.github.io/iclr_2019/
219 stars 40 forks source link

Submission for #97 #141

Open JACKHAHA363 opened 5 years ago

JACKHAHA363 commented 5 years ago

Submission for #97

JACKHAHA363 commented 5 years ago

Hi, where can I see the review?

koustuvsinha commented 5 years ago

Hi @JACKHAHA363, reviewer(s) has been just assigned to all the projects, the reviews will be posted by our bot @reproducibility-org in the respective Pull Requests, and you will have the opportunity to correct your submission and respond to the reviewers.

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: ~5~ 6 * Reviewer 3 comment : The report aim to reproduce the policy gradient based benchmarks from the "Countering Language Drift via Grounding" paper. The report analyzes some of the claims and verify(or refute in some cases) them independently. The code has been made available publically.

General questions:

Some questions related to implementation:

* Edit: Reviewer updated score based on author feedback

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 7 Reviewer 1 comment : Problem statement The authors do a good job summarizing the problem addressed in the original paper (the problem of language drift in collaborative translation) and the approach to solving it (reducing language drift by introducing an appropriate grounding task). They identify a likely source of difficulty in reproducing the work (the policy gradient method used to train the collaborative translation) and focus their efforts on showing that this component of the work is reproducible.

In light of this, the authors restrict their analysis to the policy gradient baseline used in the original paper. This seems like a sensible scope for a reproduction study, as it focuses on one aspect of the problem that is frequently difficult to reproduce. 

However, this choice also limits the potential usefulness of the reproduction: the main result of the original paper is not that policy gradients produce the phenomenon of language drift (which was anticipated based on results elsewhere in the literature on language learning using self-play), but that the proposed grounding method reliably solves this. Results of this nature would most likely strengthen the impact of the reproduction. 

That being said, the authors very clearly specify the scope of the reproduction they are attempting, so I don’t feel this unduly limits the usefulness of this reproduction.

Code The code included with this submission is reasonably well-organized and readable. The code appears to reimplement the project from scratch.

Communication with original authors I did not see any evidence of communication with the original authors.

Hyperparameter search The authors performed a grid search over appropriate policy gradient hyperparameters (the entropy weight and the learning rate). This resulted in improved performance on the policy gradient baseline over the original submission.

Ablation Study As far as I could tell, no ablations were performed.

Discussion of results The results are presented thoughtfully, with tables and plots that allow direct and easy comparison between the original and reproduced results. The results appear to be consistent with the original paper. Although somewhat better policy gradient baseline results were obtained, these results did not surpass the performance of the improved method presented in Lee et al 2019. This is consistent with the original paper’s claim (although as noted above, the claim of reproducibility is weakened as the improved method itself is not reproduced in this work). 

Recommendations for reproducibility The authors point out several implementation details that they found important or relevant to getting good results with the model or affected the results of the model, but which were not discussed in the original paper (e.g. length normalization, sharing GRU weights). These details may be very helpful for researchers building on this work in the future.

Overall organization and clarity The report, results, and code are well-organized and interpretable. I believe this reproduction will be of use to other researchers attempting to build on and understand the work of Lee et al 2019. Confidence : 3

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 7 Reviewer 2 comment : The authors successfully implemented the approach proposed in the original paper to solving the language drift when using policy gradient methods. With the implementation, they achieved comparable results. Furthermore, the authors found that length normalisation for reward computation is essential to the results, which is not explicitly mentioned in the original paper. Overall, this is a good report, not only reproducing the main results but giving some additional insights as well. Confidence : 4

JACKHAHA363 commented 5 years ago

Repsonse reviewer 3

The report-authors mentioned that they discussed the sequence of length on open-review with the paper-authors. I must have missed it while reading the discussion on open review. Could the authors please provide a link to that discussion.

I rechecked on Openreview and realized that the original comment was set to private. It is public now, and the reviewer should be able to see the exchange.

Why use vanilla policy gradients only?

Using a more advanced policy gradient methods like PPO and TRPO is definitely worth trying, but we think it might be beyond the scope of this report because we want to stay close to the original paper. In addition, this policy gradient method (REINFORCE with learnt value baseline) is also widely employed in current self-play/RL in NLP community[1, 2], so we think confirming the language drift of this method should be representative. That being said, we are aware of this and we also implemented the PPO here https://github.com/JACKHAHA363/language_drift/blob/master/ld_research/training/finetune_ppo.py

The hyper-params tweaked are: lr of agent A and \alpha_ent? Or are there more hyperparams that were optimised for?

Yes. We also try to use a linearly decaying learning rate, but that turns out to converge to suboptimal. The hyperparameters on the Agent B does not make much difference, but maybe we did not perform a thorough enough search on that. The motivation for our focus on learning rate and \alpha_ent is that 1) policy gradient is known to be very sensitive to the learning rate, which is the motivation of methods like TRPO, and 2) the effect of \alpha_ent to the finetune results is discussed by one of the reviewer and the authors. We update the paper to include more details on hyperparameter optimization.

The pretraining results have not been reproduced to a large extent - do you thoughts about why that might be the case?

We think it could be caused by learning rate schedule. In the original paper, the authors does not articulate these, even if they seem to perform , and we use the default one from OpenNMT. We think it could be worthwhile to reproduce the pretrained results, but the main focus of this report would be to confirm the language drift from pre-training to policy gradient fine-tuning on a new corpus.

Response Reviewer 1

However, this choice also limits the potential usefulness of the reproduction: the main result of the original paper is not that policy gradients produce the phenomenon of language drift (which was anticipated based on results elsewhere in the literature on language learning using self-play), but that the proposed grounding method reliably solves this.

Yes we agree that we did not reproduce the main claim of the authors, but in our experiments we have some ongoing results of finetuning using language model, which is implemented here https://github.com/JACKHAHA363/language_drift/blob/master/ld_research/training/finetune.py#L531. We choose to restrict ourselves, since we would like to have a more thorough discussion, which is also kindly noticed in your review.

Reference: [1] Gao, Jianfeng, Michel Galley, and Lihong Li. "Neural approaches to conversational AI." The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2018.

[2] Bahdanau, Dzmitry, et al. "An actor-critic algorithm for sequence prediction." arXiv preprint arXiv:1607.07086 (2016).

JACKHAHA363 commented 5 years ago

To all reviewers

Thank you for your time! we just update our paper 7d522be accordingly to include a section highlighting the potential pitfalls and challenge during our attempt to reproduce. @reproducibility-org

JACKHAHA363 commented 5 years ago

@koustuvsinha Will the bot update my response?

JACKHAHA363 commented 5 years ago

@reproducibility-org Any updates?

JACKHAHA363 commented 5 years ago

Updates?