reproducibility-challenge / iclr_2019

ICLR Reproducibility Challenge 2019
https://reproducibility-challenge.github.io/iclr_2019/
219 stars 40 forks source link

Submit report for S1erHoR5t7 (#10) #135

Open LoSealL opened 5 years ago

LoSealL commented 5 years ago

The reproducibility report for RGAN (#10)

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 1 comment : This report tests some generalization properties -- rather than the sole reproducibility of the exact results provided in the original paper -- by applying the model to CIFAR-10 (as in the original work) and to a different dataset not used in the original publication. It is unclear whether this is done because the author could not get access to the CAT dataset, or for any other reason (not stated). The report and the original paper also diverge with minimal overlap in the types of network architectures they implement. Furthermore, there might be some slight inconsistencies with the hyperparameters selected for training. Therefore, while this offers further insight into the strengths and possible failure modes of the proposed solution, it hardly provides a general validation (or invalidation) of the results reported in the original paper. Any discrepancy observed between the results in the original paper and those in this report can be attributed to several confounding factors introduced by the new setup. In general, this report doesn't quite make any strong statement nor does it shed light on the reproducibility status of the original paper/code. There is some degree of qualitative agreement in certain experiments, along with other contrasting observations about the relative success of the R- and Ra- GAN formulations.

In detail: the author describes some of the image pre-processing and data loading steps they follow, though without enough detail (random seeds, etc.) for it to be fully reproducible. For context, the original paper clearly stated the seed value they used, for example. Furthermore, the author of the reproducibility report does not comment on whether these pre-processing procedures match the ones used in the original publication and whether they might introduce some systematic discrepancies.

This report clearly states the layer composition of the architectures tested in this study (see Table 1), though it is left to the reader to understand exactly whether these match any of the ones in the original paper. As far as I can tell, what the author of the report calls DCGAN 32x32 is equivalent to what the original paper calls Standard CNN. The DCGAN 64x64, DCGAN 128x128, and DCGAN 256x256 used in the original paper are not re-implemented. This is understandable and acceptable given limited computing resources. On the other hand, the author builds and tests the performance of a Resnet architecture instead. This is helpful to further explore the generality of the method proposed in the original work.

On top of the metric used to report results in the original paper, the author of the report also provides results under a different metric, thus providing more information for analyzers to digest the results. The original paper, however, mentions that their choice of using one metric over the other is dictated by the former having a stronger correlation with image quality than the latter. This issue is not addressed in the reproducibility result; therefore, any discrepancy in the interpretation of the results under different metrics might be due to the metric being inappropriate for the task, and may not actually provide any further useful insight on the performance of the method proposed in the original paper.

Although the author of the reproducibility report enumerates the various hyperparameters used for training (rate of discriminator to generator updates, optimizer parameters, etc.), it never states whether this matches what was done in the original paper, making it hard to cross-check what stays constant and what doesn't between the two experimental setups. Since the meaning and interpretation of the results varies depending on whether the hyperparameters tested in this work are identical to the ones in the original work, it would be key for the author to state the differences and similarities more clearly.

Overall, this report highlights the fact that the choice of normalization can significantly affect the convergence properties (and therefore the results) of the trainings presented in the original paper. It also points to the fact that it is unclear whether the R-formulation can be expected to be successful for training GANs with any architecture choice, as shown in the experiments that compare DCGAN to Resnet. One should note though that the contradictory results may arise for a variety of different reasons involving the coupling between the dataset and architecture choices, their mutual suitability, and more, and may not be indicative of the effectiveness of the method proposed in the original paper. More in-depth discussion of the suspected causes of variation could be useful to the reader.

Problem statement:

The report shows sufficient understanding of the problem addressed by the solution proposed in the original paper and concisely, yet not masterfully, summarizes it in the introduction paragraph.

Code:

The author reimplements the original code in TensorFlow and shares it in a new repository, with instructions on how to run it. This new implementation, however, lives within a larger repo of reimplemented models for video and image super-resolution, and heavily depends on multiple classes and functions implemented within this larger framework, thus introducing the meta-problem of having to validate and verify another not necessarily straightforward implementation and source code. This makes it hard to debug and I cannot attest this implementation is bug-free. On the positive side, the code is rather clean, neat, easy to follow and read. Having an implementation in TensorFlow, on top of the original PyTorch implementation, can be beneficial for the community. The author does not comment on the difficulty of reimplementing the code, nor on whether they spotted any issue with the original code during this process. The report does not state whether they reimplemented the code by looking at the original code or directly from the details in the paper.

Communication with the original authors:

The author communicated their findings to the original author on Open Review, but no in-depth analysis of the agreements/disagreements ensued. The author did not use Open Review to attempt to clarify any details of the original publication that may not be documented in the original publication and may be crucial for reproducibility.

Hyperparameter Search:

Overall, no real hyperparameter search is performed to verify the optimality of those picked in the original paper and to measure the sensitivity of the significance of the results on these variables. Both the report and the original paper only try out Adam as optimizer. The learning rates in the original paper are 0.0001 and 0.0002. The report, however, only uses a fixed learning rate of 0.0002 and does not explore other values for this hyperparameter. The Adam beta parameters also match the ones that the original author had picked one of the two originally tested setups, and no further exploration is provided in this reproducibility report. The number of discriminator updates n_D is also not varied compared to the one used in the original paper.

Ablation Study:

The author tries out three different types of normalization (no normalization, batch norm, spectral norm), to augment the original author's choice of using batch norm in the generator and spectral norm in the discriminator for the CIFAR-10 experiments. These three normalization choices can return wildly different results. No other ablation studies are present.

Discussion on results:

The discussion of the results and their implication can be found in section 2.5. This is more of an analysis of the results obtained in this round of experiments rather than a detailed discussion on the state of reproducibility of the original paper.

Recommendations for reproducibility:

The report makes no suggestion on how the original author could have improved their paper or code release to ease reproducibility efforts.

Overall organization and clarity:

This report would strongly benefit from further proofreading and grammar review by a native English speaker. Confidence : 4

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 2 comment :

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 7 Reviewer 3 comment : Overall I think the author did a pretty good job at reproducing the original paper's results, and resulted in some findings that disagree with the paper's findings.

Problem statement Stated reasonably well.

Code Reproduced from scratch (original was in pytorch, this one was in tensorflow). Seems to be part of an existing framework, so authors didn't have to re-implement a lot of the boilerplate code.

Communication with original authors There was communication with the authors on the OpenReview site: https://openreview.net/forum?id=S1erHoR5t7 The authors did not seem to raise any issue with the re-implementation, but did express surprise at the poorer results of RaGAN.

Hyperparameter search The reproducibility author did not do a hyperparameter search. It appears they used the same hyperparameter settings as the author of the original paper.

Ablation study No ablation study.

Discussion on results Some, but more of a summary of findings with little discussion on why their results seem to disagree with the findings in the original paper.

Recommendations for reproducibility No.

Overall organization and clarity Relatively clear, although the grammar needs to be improved throughout. Table 1 needs to be clarified, it's not clear what it's conveying. Confidence : 4