Open colah opened 6 years ago
While there are many good remarks, this claim is demonstrably incorrect:
I'm skeptical whether this work is interesting enough for Distill. It is based on work that has been available on the web for over a year and has attracted little interest. If this was a conference reviewing system I think this paper would be rejected for low interest / low novelty at the least.
It got some interest on Reddit Machine Learning and Hacker News.
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to give such a thorough review of this article. Thoughtful and invested reviewers are essential to the success of the Distill project.
Conflicts of Interest: Reviewer disclosed no conflicts of interest.
High level:
I guess if this were a regular journal I would probably recommend "reject" or "resubmit with major revision" after deciding whether it's a pedagogy piece or a weight decay advocacy paper.
General suggestions:
Just wrong: "In practice, using an appropriate level of regularization helps avoid overfitting and constitutes a simple form of adversarial training." -> weight decay isn't the same as adversarial training. The phrase "adversarial training" originates in "Explaining and Harnessing Adversarial Examples" by Goodfellow et al. Section 5 of that paper compares adversarial training to weight decay and shows how they are different things.
Medium size problem:
Nits: "In linear models and small neural nets, L2 regularization can be understood as a balancing mechanism between two objectives: minimizing the training error errtrainerrtrain and maximizing the average distance between the data and the boundary dadvdadv." -> maybe unpack this into a few more sentences. When I first read it, I thought the boundary was called d_adv. It took me a few reads to get that the d_adv was the distance.
Typos: " strongly titled." -> "strongly tilted"
One more thought: they really should use SVHN instead of MNIST. MNIST is basically solved for norm-constrained adversarial examples now. There's a trivial solution where you just threshold each pixel at 0.5 into a binary value, and the latest training algorithms are able to discover this solution. SVHN is not a lot more computationally demanding than MNIST but doesn't have this trivial solution.