Submit report for S1erHoR5t7 (#10)

LoSealL commented 5 years ago

The reproducibility report for RGAN (#10)

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 1 comment : This report tests some generalization properties -- rather than the sole reproducibility of the exact results provided in the original paper -- by applying the model to CIFAR-10 (as in the original work) and to a different dataset not used in the original publication. It is unclear whether this is done because the author could not get access to the CAT dataset, or for any other reason (not stated). The report and the original paper also diverge with minimal overlap in the types of network architectures they implement. Furthermore, there might be some slight inconsistencies with the hyperparameters selected for training. Therefore, while this offers further insight into the strengths and possible failure modes of the proposed solution, it hardly provides a general validation (or invalidation) of the results reported in the original paper. Any discrepancy observed between the results in the original paper and those in this report can be attributed to several confounding factors introduced by the new setup. In general, this report doesn't quite make any strong statement nor does it shed light on the reproducibility status of the original paper/code. There is some degree of qualitative agreement in certain experiments, along with other contrasting observations about the relative success of the R- and Ra- GAN formulations.

In detail: the author describes some of the image pre-processing and data loading steps they follow, though without enough detail (random seeds, etc.) for it to be fully reproducible. For context, the original paper clearly stated the seed value they used, for example. Furthermore, the author of the reproducibility report does not comment on whether these pre-processing procedures match the ones used in the original publication and whether they might introduce some systematic discrepancies.

This report clearly states the layer composition of the architectures tested in this study (see Table 1), though it is left to the reader to understand exactly whether these match any of the ones in the original paper. As far as I can tell, what the author of the report calls DCGAN 32x32 is equivalent to what the original paper calls Standard CNN. The DCGAN 64x64, DCGAN 128x128, and DCGAN 256x256 used in the original paper are not re-implemented. This is understandable and acceptable given limited computing resources. On the other hand, the author builds and tests the performance of a Resnet architecture instead. This is helpful to further explore the generality of the method proposed in the original work.

On top of the metric used to report results in the original paper, the author of the report also provides results under a different metric, thus providing more information for analyzers to digest the results. The original paper, however, mentions that their choice of using one metric over the other is dictated by the former having a stronger correlation with image quality than the latter. This issue is not addressed in the reproducibility result; therefore, any discrepancy in the interpretation of the results under different metrics might be due to the metric being inappropriate for the task, and may not actually provide any further useful insight on the performance of the method proposed in the original paper.

Although the author of the reproducibility report enumerates the various hyperparameters used for training (rate of discriminator to generator updates, optimizer parameters, etc.), it never states whether this matches what was done in the original paper, making it hard to cross-check what stays constant and what doesn't between the two experimental setups. Since the meaning and interpretation of the results varies depending on whether the hyperparameters tested in this work are identical to the ones in the original work, it would be key for the author to state the differences and similarities more clearly.

Overall, this report highlights the fact that the choice of normalization can significantly affect the convergence properties (and therefore the results) of the trainings presented in the original paper. It also points to the fact that it is unclear whether the R-formulation can be expected to be successful for training GANs with any architecture choice, as shown in the experiments that compare DCGAN to Resnet. One should note though that the contradictory results may arise for a variety of different reasons involving the coupling between the dataset and architecture choices, their mutual suitability, and more, and may not be indicative of the effectiveness of the method proposed in the original paper. More in-depth discussion of the suspected causes of variation could be useful to the reader.

Problem statement:

The report shows sufficient understanding of the problem addressed by the solution proposed in the original paper and concisely, yet not masterfully, summarizes it in the introduction paragraph.

Code:

The author reimplements the original code in TensorFlow and shares it in a new repository, with instructions on how to run it. This new implementation, however, lives within a larger repo of reimplemented models for video and image super-resolution, and heavily depends on multiple classes and functions implemented within this larger framework, thus introducing the meta-problem of having to validate and verify another not necessarily straightforward implementation and source code. This makes it hard to debug and I cannot attest this implementation is bug-free. On the positive side, the code is rather clean, neat, easy to follow and read. Having an implementation in TensorFlow, on top of the original PyTorch implementation, can be beneficial for the community. The author does not comment on the difficulty of reimplementing the code, nor on whether they spotted any issue with the original code during this process. The report does not state whether they reimplemented the code by looking at the original code or directly from the details in the paper.

Communication with the original authors:

The author communicated their findings to the original author on Open Review, but no in-depth analysis of the agreements/disagreements ensued. The author did not use Open Review to attempt to clarify any details of the original publication that may not be documented in the original publication and may be crucial for reproducibility.

Hyperparameter Search:

Overall, no real hyperparameter search is performed to verify the optimality of those picked in the original paper and to measure the sensitivity of the significance of the results on these variables. Both the report and the original paper only try out Adam as optimizer. The learning rates in the original paper are 0.0001 and 0.0002. The report, however, only uses a fixed learning rate of 0.0002 and does not explore other values for this hyperparameter. The Adam beta parameters also match the ones that the original author had picked one of the two originally tested setups, and no further exploration is provided in this reproducibility report. The number of discriminator updates n_D is also not varied compared to the one used in the original paper.

Ablation Study:

The author tries out three different types of normalization (no normalization, batch norm, spectral norm), to augment the original author's choice of using batch norm in the generator and spectral norm in the discriminator for the CIFAR-10 experiments. These three normalization choices can return wildly different results. No other ablation studies are present.

Discussion on results:

The discussion of the results and their implication can be found in section 2.5. This is more of an analysis of the results obtained in this round of experiments rather than a detailed discussion on the state of reproducibility of the original paper.

Recommendations for reproducibility:

The report makes no suggestion on how the original author could have improved their paper or code release to ease reproducibility efforts.

Overall organization and clarity:

This report would strongly benefit from further proofreading and grammar review by a native English speaker. Confidence : 4

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 4 Reviewer 2 comment :

Problem statement The problem of stabilizing GANs and enhancing the quality of the generated images is clearly stated in the report, and briefly discussed by the authors.
Code The authors mention reimplementing the Pytorch code provided by the original authors using Tensorflow and making it available on GitHub. However, what they provided is a link to a huge repo where they have implemented 30+ deep models. It is not clear which code belongs to this experiment, and a quick search for GAN keyword does not help. Despite the fact that they provide instructions on how to run their model, I find it this method of sharing the code useless and unhelpful. It is necessary to provide the code for this paper separately.
Communication with original authors The report does not mention communication with the authors, but the flag for “Openreview-comment” is provided. The authors have shared their results on open review with a link to their code, and the original authors showed interest in looking at it.
Hyperparameter Search The authors only replicated the network architecture in the original paper (a ConvNet), and proposed using a resNet architecture to compare it. They also proposed using two normalization techniques which the original paper didn’t use. Namely, BatchNorm, and Spectral Norm. There is no mention of hyperparameter tuning, such as the number of layers, or filters in the convNet.
Ablation Study The authors implicitly study the effect of normalization on training GANs. Specifically, in Table~3, the authors show the FID results with and without Spectral Normalization (SN). I find the results very important and interesting. However, there is no meaningful discussion for the results. Particularly, the results show that standard non-saturating GANs (NSGANs) have the best FID performance compared to the other methods when SN is used. Additionally, NSGAN-GP’s FID drops from 13.00 (being the best variation) to 290.1, in the second to last position. None of these results were discussed or briefly noted.
Discussion on results The discussion is minimal, and rarely goes beyond describing the results. In one case, the authors state that RaGAN improves when normalization is used, however, this contradicts their results in Table~3. Finally, the authors only make a recommendation of using normalization to increase the performance.
Recommendations for reproducibility Only two recommendations were made: using spectral norm for normalization and using the GAN with the Gradient Penalty variant. This seems to be based on the authors’ results.
Overall organization and clarity The report is ok. Clearly the authors put a lot of effort while working on it, but it still has a number of issues. Starting with the code that is not clearly available for inspection, then the results that are merely provided without any meaningful discussion, and finally, the general presentation of the results. In details, I believe that Table~2 that contain the results should be divided to two tables. I find it very problematic to compare the results of two metrics where a higher value for one indicates a better performance (the inception score), while a higher value for the other indicates a lower performance (the FID). Table~3 as well could be better visualized with a bar graph since there are not two many results, (6 models, two settings each). Finally, the authors provide some qualitative results (generated images) in the appendix. However, neither the quality was discussed, nor the appendices and their contents were mentioned in the report.

Confidence : 5

reproducibility-org commented 5 years ago

Hi, please find below a review submitted by one of the reviewers:

Score: 7 Reviewer 3 comment : Overall I think the author did a pretty good job at reproducing the original paper's results, and resulted in some findings that disagree with the paper's findings.

Problem statement Stated reasonably well.

Code Reproduced from scratch (original was in pytorch, this one was in tensorflow). Seems to be part of an existing framework, so authors didn't have to re-implement a lot of the boilerplate code.

Communication with original authors There was communication with the authors on the OpenReview site: https://openreview.net/forum?id=S1erHoR5t7 The authors did not seem to raise any issue with the re-implementation, but did express surprise at the poorer results of RaGAN.

Hyperparameter search The reproducibility author did not do a hyperparameter search. It appears they used the same hyperparameter settings as the author of the original paper.

Ablation study No ablation study.

Discussion on results Some, but more of a summary of findings with little discussion on why their results seem to disagree with the findings in the original paper.

Recommendations for reproducibility No.

Overall organization and clarity Relatively clear, although the grammar needs to be improved throughout. Table 1 needs to be clarified, it's not clear what it's conveying. Confidence : 4

reproducibility-challenge / iclr_2019