thomas-tanay / post--L2-regularization

Distill submission
33 stars 1 forks source link

Review Report 3 - Anonymous Reviewer C #13

Open colah opened 6 years ago

colah commented 6 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to give such a thorough review of this article. Thoughtful and invested reviewers are essential to the success of the Distill project.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.


High level:

I guess if this were a regular journal I would probably recommend "reject" or "resubmit with major revision" after deciding whether it's a pedagogy piece or a weight decay advocacy paper.

General suggestions:

Just wrong: "In practice, using an appropriate level of regularization helps avoid overfitting and constitutes a simple form of adversarial training." -> weight decay isn't the same as adversarial training. The phrase "adversarial training" originates in "Explaining and Harnessing Adversarial Examples" by Goodfellow et al. Section 5 of that paper compares adversarial training to weight decay and shows how they are different things.

Medium size problem:

Nits: "In linear models and small neural nets, L2 regularization can be understood as a balancing mechanism between two objectives: minimizing the training error errtrainerrtrain and maximizing the average distance between the data and the boundary dadvdadv." -> maybe unpack this into a few more sentences. When I first read it, I thought the boundary was called d_adv. It took me a few reads to get that the d_adv was the distance.

Typos: " strongly titled." -> "strongly tilted"

One more thought: they really should use SVHN instead of MNIST. MNIST is basically solved for norm-constrained adversarial examples now. There's a trivial solution where you just threshold each pixel at 0.5 into a binary value, and the latest training algorithms are able to discover this solution. SVHN is not a lot more computationally demanding than MNIST but doesn't have this trivial solution.

stared commented 6 years ago

While there are many good remarks, this claim is demonstrably incorrect:

I'm skeptical whether this work is interesting enough for Distill. It is based on work that has been available on the web for over a year and has attracted little interest. If this was a conference reviewing system I think this paper would be rejected for low interest / low novelty at the least.

It got some interest on Reddit Machine Learning and Hacker News.