reproducibility-challenge / iclr_2019

ICLR Reproducibility Challenge 2019
https://reproducibility-challenge.github.io/iclr_2019/
219 stars 40 forks source link

Submission for issue #65 #144

Open zsq007 opened 5 years ago

zsq007 commented 5 years ago

65

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 2 Reviewer 1 comment : Problem statement and report The authors include a long discussion of the problem setting and the method proposed in the original paper. Unfortunately, the report largely consists of plagiarism of the original submission. This severely limits the suitability of the report for publication.

The description of what is performed in the reproduction is buried in this text. This makes it difficult for the reader to identify what is original to the reproducibility report and what can already be found in the original paper. For example, in the introduction of the report, only the three bullet points are original; the rest is paraphrased from the original paper. 

The only substantial discussion of reproduction results is in section 4. Very little motivation is given for the choices of what to reproduce from the original paper are given (Why these hyper parameters only? Why only MNIST? Why are the results from only a single experiment from the original paper discussed?)

Code This submission includes no high-level documentation and is not organized as a repo on Github (all code is in a zip file). The code itself is disorganized and includes almost no comments, and as such is inaccessible to the intended audience (researchers interested in a interpretable, reliable reproduction).

Communication with original authors This submission reports no attempts to communicate with the authors of the original report.

Hyperparameter search Some effects of hyperparameters are mentioned in section 4. These effects appear to be anecdotal, as no quantitative effects of using different hyperparameters are shown. In general, the choice of hyperparameters is not clearly discussed, and no attempts to address differences between the original and reproduced submissions by modifying hyperparameters are mentioned. 

Ablation Study No ablations are mentioned.

Discussion of results Based on the authors’ descriptions of their reproduction, it might seem like they have successfully reproduced the major results from the original paper. However, the results as presented do not justify such a conclusion. The reproduction only attempts to reproduce results on MNIST and not the two other (and perhaps more interesting) datasets. On MNIST, results are only presented for a single hyperparameter setting, and even here it is not clear that the results are reproduced. For example, the reproduction reports a false positive rate of ~20 after 100 epochs, while the original paper reported a false positive rate of <5 on the same setting. The authors report differences in training dynamics and speculate on the sources of this difference, but no evidence for the conclusions is given.

For future reproductions, I would urge the authors to attempt to carefully reproduce the baseline results as well. This would allow for much better apples-to-apples comparisons to the plots and tables in the original paper, and might help in diagnosing problems with the reimplementation.

Unfortunately, the authors generalize too broadly from the results reproduction. For example, the following statement is presented with no supporting evidence: “ In sum, the author’s algorithm can perform better than the traditional algorithms and can converge eventually when using the MNIST data set.”

Recommendations for reproducibility The authors of the reproduction offer no recommendations to the original authors.

Overall organization and clarity In addition to the extensive paraphrasing and plagiarism of the original paper mentioned above, there are many issues with grammar and phrasing throughout the report. This makes it hard to understand exactly what design decisions are being made. For example, in the following sentence fragment in section 3.1.2: “Biased N data uniformly to the latent categories: [1,3,5,7,9] and the probability is [0.03, 0.15, 0.3, 0.02, 0.5].” It is difficult to understand what “N” and “the probability” refer to here.

Figure labels and captions would benefit from more detail. From Fig 2 alone, it is not obvious that the results are for the PUbN\N method on the MNIST dataset, or that this figure is reproducing figure Fig 2a of the original manuscript. These facts can be inferred from the text, but including this information in the figure would make the reader’s job much easier. Confidence : 4

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 3 Reviewer 3 comment : Unfortunately, I couldn't locate any explanation/report. All I see is the zipped file, which contains code without much comments/explanation. Which makes it extremely hard to decipher what is going on.

It is hard to know what experiments from the original paper are reproduced and what the results look like. A self contained notebook with limited set of experiments might be more helpful to the readers.

There is also no mention of communication with the authors to clarify any questions or any hyperparameter investigation.

Confidence : 3

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 3 Reviewer 2 comment : The authors attempt to replicate the paper "Classification from Positive, Unlabeled and Biased Negative Data". They present a report that has high amount of plagiarised information from the original paper. I would advise the authors to rephrase their description to give an insight into their understanding of the problem. Also, the codebase is not structured and hence, doesn't help improve reproducibility. Given the challenge the authors are participating in, I feel it is extremely important to arrange their code in a proper github structure. Although the authors are able to replicate some of the experiments, they attribute the differences to hyperparameter choice. It would be helpful to see how these differences get escalated or shrink with different hyperparameters. Overall, I feel that the authors need to arrange their code base better to suit the context of the challenge. Also, it would help if they could add certain recommendations to improve reproducibility given their experience of implementing the paper. Confidence : 4