reproducibility-challenge / iclr_2019

ICLR Reproducibility Challenge 2019
https://reproducibility-challenge.github.io/iclr_2019/
219 stars 40 forks source link

Submission for issue #85 #145

Open ptrcarta opened 5 years ago

ptrcarta commented 5 years ago

85

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 7 Reviewer 1 comment : The authors of the report summarize the content of the paper in a very clear, concise, and accessible way in the introductory chapter.

In general, they certify the original paper's results as "viable" and "robust". They communicate that they were able to reproduce the key results of the paper, although they found a few discrepancies with the implementation details provided in the original paper. These don't seem to radically affect any conclusion.

Through no fault of their own, but because of difficulties encountered in running the original code and the unavailability of adequate computing resources, the authors admit that report falls short from providing an exhaustive investigation of the validity of the results in the original paper, and cannot guarantee that no error is present in the original results. Nonetheless, I find this report to be extremely useful in testing some of the main, basic findings of the original work, and highlighting the specific practical issues encountered in this effort. Continuing to document these experiences is extremely valuable for the community to better understand and overcome the obstacles that prevent machine learning from enjoying the status of reproducible, scientific field.

The authors report that code from the original submission was shared publicly, but because of lack of version dependency specification, the authors of the report were unable to run the original code without errors, and therefore resorted to reimplementing the code themselves. This useful, actionable feedback certainly points to a shortcoming in the original paper's reproducibility status that could easily be addressed by the original authors.

The new code is implemented in TensorFlow, which complements the original PyTorch implementation.

The authors make useful observations about their practical experience reproducing the original work. For example, they inform us that certain experimental choices, such as the size of the perturbation Delta x in the "Gaussian Balls" experiment, were not documented by the original authors, so they contribute to the reproducibility of the paper by sharing their observation and insight on what reasonable choices for its magnitude might be.

I find the willingness of the authors to share the difficulties they encountered in this reproducibility effort and their decision-making process around the technology to use (in the "Machine Learning Stack" section) extremely useful in painting a clear and powerful picture of many researchers' daily struggles with reproducing promising results with limited resources. They explicitly state that the absence of key ingredients for reproducibility (which forced them to have to reimplement the code from scratch -- which does however carry some added benefits, as they also point out) prevented them from being able to carry out the tasks they intended to run and put a damper on their efforts.

Among other useful suggestions, they point to containerization as a possible solution to combat the issues that arise from differences in environments, package versions, etc.

Perhaps some of the reproducibility and versioning issues could have been solved by engaging with the original authors in a conversation on OpenReview. It is unclear whether this happened in a non-public setting (one of the official reviewers wrote: "I have started a thread which is only visible by the authors, in order to keep the page limited to discussion about the content of the paper rather than technical help"). Prior to that comment, the authors had explicitly posted more information about their setup and library versions in this comment: https://openreview.net/forum?id=SJekyhCctQ&noteId=Hyx_7PKyT7.

More in detail, the authors confirm the observation that adding the fingerprint portion of the loss causes the decision boundary to become highly non-linear, as discussed in the original paper. In discussing their results, the authors of the report provide sufficient detail about the experimental setup they use. Although I thank the authors for their transparency in sharing the hyperparameters and experimental choices they made, it would be useful to accompany them with a description of whether these choices were made to match the original implementation or if they represent the authors' desire to test the original method under different conditions.

I am unsure how the "BIM" attack in the report maps to the BIM-a and BIM-b methods described in the first paper.

For some of the results and plots, the authors invite the reader to check out supplementary material in a jupyter notebook.

This contribution reports that the original authors had found reasonable choices for the alpha and beta parameters values in Delta y to be 0.25 and 0.75, and that the proposed method is robust to randomization of the sign of Delta y. The validity of these choices and statements is challenged in this reproducibility work by showing evidence of delayed convergence in one of their experiments using the original choice of parameters. They conclude that a different choice for Delta y could be superior, but no experimental evidence for that is shown in the report.

In checking other hyperparameters, such as the choice of magnitude of the epsilon parameter in the "Gaussian Balls" experiment, the authors make clear, insightful suggestions on how to pick a sensible value, and why smaller values might make the optimization process harder.

Unavailability of compute justifies the attempt to replicate only a few selected results from the paper, as well as the absence of ablation studies, or extensive hyperparameter sweeps.

Algorithm 1 abruptly appears in the paper without reference in the text.

The axis in Figure 1 are unlabeled and the values associated with the ticks are very hard to read. Figure 3 is not referenced in the text.

Please be consistent with your capitalization and italicization of the term "Neural Fingerprinting".

The report contains a few typos that can be easily fixed. Abstract:

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 6 Reviewer 3 comment : * The report is largely well written with only minor issues like typos, Fig 1 caption is not readable,

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 2 comment : [Problem statement] The introduction to the problem is not obvious. Although presenting the definition of an "adversarial example", we do not know what the original task is. Because I read that there are many "classes" I imagined that the original network task is to do some classification, but this is never mentioned in the report. A more "gentle" introduction to the topic could improve this report.

[Code] The authors of this report tried to run the code of the original code but were faced with software version errors. Thus, they re-implemented their version with the current version of Tensorflow. Notice that the authors of the report did not explicitly mention the version of Tensorflow they used to reproduce the paper, thus reproducing the same error as the original authors.

[Communication with original authors] No communication is mentioned.

[Hyperparameter Search] The authors tried to reproduce the experiments of the original paper, but no extra parameter search was performed. However, this work mention extra experiments with different number of fingerprints as described in Section 4.

[Discussion on results] Not all the experiments from the original paper were possible to reproduce due to limited computation access. This is not a problem in this report. The discussion of each experiment is mostly descriptive and there is not a lot of analysis into why certain behavior can be observed. The figures are not explained in the report. Adding analysis of the results showed in each figure can improve this work a lot. For instance, in section 3.2 > last paragraph, the authors say "Next, we <...do something...>. The result is described on Figure 6 <...>. " and pass to the next section. This shows no analysis of the results. This work reports some of the encountered difficulties in reproducing the results from the original paper, and propose a re-implementation of the paper. This can be good feedback for the original authors.

[Overall organization and clarity] The overall organization of the report is fine, however, some sections should be much more detailed. For instance, section 5 is 2 sentences long. There is no need to create a "section" if it holds so few information. Some typos and other visual issues should be fixed:

This report can be improved in many ways, both in terms of visuals, and in terms of content. Confidence : 4