Two problems about paper

seanM29 commented 5 years ago

In the paper, Figure 6, the only difference between (c) and (d) is that (d) is much more flat than (c), and the test error of (d) is smaller than (c), so if a loss surface, which we use Filter-Wise Normalization to get, is flatter, it will have a better Generalization? is there any explanation or math proof?
In final part of section6, paper use (min eigenvalues of the Hessian / max eigenvalues of the Hessian) to represent convex, larger value indicate a more non-convex region, smaller value indicate a more convex region, why? Thank you for sharing results of your work. This is a really impressive paper and your response is appreciated.

Jamesswiz commented 5 years ago

I have the same query after reading the paper.

Can the authors please comment?

liiliiliil commented 3 years ago

I also don't understand the first question. : (

For the second one, I think the key is to prove that convex-looking regions in projected surface is relatively convex in original surface. A small absolute ratio means that the max eigenvalue is big enough compared to min eigenvalue which may be a negative value, that is the postive eigenvalue is dominant, so a convex-looking region in projected surface which has a small absolute ratio is a relatively convex region in original surface.

knowlen commented 3 years ago

Not affiliated with the paper, but in non-convex optimization it is generally believed that wider minima should generalize better than sharp minima. This clip from Leo Dirac (start at 16:30) conveys the intuition. The paper result capture it empirically.

tomgoldstein / loss-landscape

Two problems about paper #24