reproducibility-challenge / iclr_2019

ICLR Reproducibility Challenge 2019
https://reproducibility-challenge.github.io/iclr_2019/
219 stars 40 forks source link

ICLR Reproducibility 2019: AutoLoss #147

Open timur26 opened 5 years ago

timur26 commented 5 years ago

Submission of AutoLoss reproducibility report for ICLR 2019 challenge. Issue number: #89

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 3 Reviewer 3 comment : PROBLEM STATEMENT There is a brief introduction to the problem address in the original paper. It is vague and makes it difficult for the reader to understand the main contributions of the ICLR submission.

CODE The code is developed from scratch.

COMMUNICATION WITH THE ORIGINAL AUTHORS A communication with the authors was established, but the provided details was not enough to reproduce the results in the ICLR submission.

HYPERPARAMETER SEARCH The report indicates that the ICLR submission is highly sensitive to hyperparameters. The authors of the report explored some hyperparameters, but it is not clear how exhaustive that was.

ABLATION STUDIES No ablation studies were reported. They probably do not apply either for the type of work being reproduced.

DISCUSSION ON RESULTS There are some mentions regarding how difficult it is to reproduce the results due to the missing details and instability from hyperparameters. However, the discussion is fuzzy and is diluted in several other observations.

RECOMMENDATIONS FOR REPODUCIBILITY No recommendations are made to the ICLR submission authors.

OVERALL ORGANIZATION AND CLARITY My main criticism to this report is that, in general, I have hard a really hard time trying to understand the manuscript. While I credit that the authors have made an effort to reproduce the ICLR submission, the submitted text makes it very difficult to actually assess their work. In particular:

W1 The organization of the paper is quite confusing, as it goes back and forth on a set of 3 tasks. I would have been much clearer to compare AutoLoss vs Baseline vs Hand crafted Schedules for each task separately.

W2 It is not clear whether the baseline and hand crafted schedules were in the original ICLR submission or are introduced by the report authors.

W3 The references to Figure 3, 4 and 5 and confusing, as they both show two plots.

W4 When commenting on Figure 4, the text talks about the 'training loss (blue)', but the blue line is labelled as 'classification loss'. This is confusing.

W5 Graphs have more colors in the legend than in the actual graph.

W6 The lambda hyperparamter is not defined. Maybe it is in the ICLR submission and, if this is the case, it should be briefly defined again in this report. The text should be self-contained.

W7 In Section 3.1, there is a claim that the model overfits, but I do not really see where or how. I would expect a validation loss growing while a training loss would decrease, which I do not see in any of the two plots in the referred Figure 2.

W8 In Section 3.1 it refers to 'as we can see' (where ?), or in Section 3.2 it refers to 'above' instead of referring to the exact Section. All these vague references make it very hard for a reader to follow the report and need to be treated carefully. Confidence : 4

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 3 Reviewer 2 comment : For this document, we will refer to the authors of the report as the authors and the writers of the original ICLR submission as the writers, similarly the reproducibility report as report and the original ICLR submission as the paper.

Problem Statement: The problem statement provided in the report is clear.

Code: Both the authors of the report and the writers of the paper have released their codes. The report does not mention whether the authors used the code provided by the writers. We would like to know the degree to which the code developed by the authors is based on that of the writers.

The code provided by the authors is well structured and is accompanied with detailed instructions required for its execution. However, we did not execute the code.

Communication with original authors:

The authors communicated with the writers. However, the authors claim that the writers did not provide sufficient support for a complete review. Specifically, the authors claim that they were unable to obtain the hyper-parameters used by the writers.

Hyperparameter search:

The authors perform a grid search with various settings of the regularizer lambda, as done by the authors. However, the authors do not experiment with the hyper-parameters beyond that. The authors do not experiment with other hyperparameters of the controller, albeit it is not a must to perform these hyper-parameter searches in my opinion.

Ablation Study: The only ablation study present in the paper (Table 2 in section A.2 of the appendix) is not replicated in the report.

Discussion on results: Could the authors provide further clarity for point1 and 2 in the discussion section? I could not understand them. Specifically, my doubts are the following: Point 1 in section 6: "the accuracy of the output..."- for which task or dataset is the accuracy reported? In point 2, could the authors point to the section in the report which supports the statement: "drastically increasing or decreasing lambda will lead to sub-optimal solutions"?

It would help me evaluate the report better.

Recommendations for reproducibility: Figure 3, 4 and 5 are unclear. For eg: In figure 3, the authors report “classification loss” and “validation loss”. It is unclear what the “classification loss” refers to.

It is unclear - Are figures 4 and 5 separately plotted figures Auto-loss schedules with handcrafted and joint minimization schedules on the same task?

There is a lack of clarity due to seemingly missing lines in the plots (lines are plotted on top of each other). Please mention that the lines are plotted on top of each other in the plot captions.

Overall organization and clarity: The report partially reproduces experiments detailed in the paper, and gives a list of observations regarding the proposed AutoLoss scheduler. However, in the report the figures need clarity (see Recommendations for reproducibility) and claims need additional support (Discussion on results). In its current form, the report was very hard to understand and I vote to reject it.

I am willing to upgrade my score if the report addresses my concerns.

Confidence : 4

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 1 comment : The problem statement is a bit too concise. Although the original paper contains all the details, the report could have made the problem statement a bit more comprehensive. It seems that communication with the authors were done, although little evidence is provided in the report. Regarding hyperparameter search, it seems that the authors only restricted a grid search as provided by the original paper. A more thorough hyperparameter search would have resulted in a better presentation.

Some points regarding discussion:

Overall organization needs a bit more work. Instead of going back and forth, I recommend trying to frame the main claims given in the paper and then present the reproducibility study alongside with it. Confidence : 4