Closed fabianp closed 1 year ago
This is what the authors of [1] meant by "interpolating property" (Assumption 2 page 3): https://github.com/google/jaxopt/blob/main/docs/stochastic.rst#adaptive-solvers
[1] Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G. and Lacoste-Julien, S., 2019. Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32.
This is already specified in Jaxopt doc: For convergence guarantees to hold, these two algorithms require the interpolation hypothesis to hold: the global optimum over D must also be a global optimum for any finite sample of D. This is typically achieved by overparametrized models (e.g neural networks) in classification tasks with separable classes, or on regression tasks without noise.
So it is a bit more general than "overparametrized" (whose exact meaning is not very clear to me). The function class does not need to be crazy expressive as long as the task is simple enough to guarantee that perfect fitting of the train set is possible and coincides with a global optimum (over the test set).
Should I add more precisions re-using your formulation ?
I think that text is fine, but it should be added also to the API doc, i.e., https://jaxopt.github.io/stable/_autosummary/jaxopt.PolyakSGD.html#jaxopt-polyaksgd
Many people will just look for the API of the solver instead of reading the narrative documentation
These methods only converge to full precision on over-parameterized problems.
Not specifying this gives the wrong impression that they should always fully converge, which is not true, and can be highly confusing, or worse, give the impression that the implementation is broken