Decide on a (class) structure for solvers

adler-j commented 8 years ago

Moved from https://github.com/odlgroup/odl-solvers/issues/3

I initialized a structure, but it is obviously far from optimal. We need to find sensible categories for solvers, e.g.

methods for finding a zero of a goal function(al), e.g. quasi-Newton schemes
simple methods for minimizing goal functions
"more complex" methods involving e.g. Fenchel/Lagrange duals and methods combining primal and dual problems. These probably need a more structured approach.

We need to reflect these categories in classes in a sensible manner.

kohr-h commented 8 years ago

To follow up on this, there is also the question of how to classify the methods that find a critical point of a function by simply finding a root of its gradient. An example is Newton's method. Other schemes explicitly go into the direction of smaller function value, so there it makes sense to speak of a minimizing method in the first place. So summarizing:

Which methods are pure root finders? These should go to findroot
Which methods are minimizers, possibly finding a gradient root, but with a strategy to ensure that it's a minimum? These should go somewhere else.

kohr-h commented 8 years ago

@ozanoktem @aringh what's your take on this?

aringh commented 8 years ago

I will have a look at it... part of this separation feels like it is highly 'personal' and the sense that for different persons it will be natural to look for different solvers under different categories.

adler-j commented 8 years ago

We shouldn't have to many categories, the current structure is to fine imo. Categories should be mutually incompatible, for example a scalar optimizer vs a equation system solver.

aringh commented 8 years ago

I agree that the number of categories should be kept small, in order to make it easier to find things. However the current categories are not optimal in my opinion; what is the difference between vector and scalar? Aren't most schemes iterative?

Also: the difference between root finders and optimization schemes is not always so sharp. Newton's method can be seen as solving g(x) = 0, but if f(x) is a (convex) function and one let g(x) = grad_f(x) then we are in fact solving an optimization problem with a second order method.

kohr-h commented 8 years ago

There seems to be no really clear-cut distinction mathematically, so I guess we should choose something which makes sense from an implementation point of view. Not too many categories, I agree, rather a few "axes" along which we distinguish. Let me throw in another suggestion:

Solvers which directly attack the operator equation, be it A(x) = y or A(x) = 0. This typically includes methods which calculate an approximate solution to the normal equation in the linear case or do the same for the linearized problem in each step of a nonlinear problem. This would include
- Landweber
- CG and CGN
- Gauss-Newton
- Newton's method in general?
"Basic" solvers which minimize some cost function(al) using only gradient and possibly Hessian plus step length and termination rules. Includes
- Newton's method in general?
- Broyden's methods
- BFGS
- Pretty much everything @aringh added
"Advanced" solvers which also minimize a cost functional but uses concepts like duals, conjugate functionals, proximal mappings etc. which require more pen-and-paper work before being usable. Includes
- Chambolle-Pock
- All that stuff the books on (convex) optimization write about

ozanoktem commented 8 years ago

There is one fairly easy way to distinguish between root finders and optimization schemes. If the solver in question takes an optimization problem as input argument, then it is an optimization scheme. If it takes an equation as input argument, then it is a root finder. It might very well be that an optimization scheme makes use of a root finder, and vice versa, a root finder might use an optimization scheme. Now, this requires that ODL has a object-structure for representing equations and/or optimization problems. If that is not feasible, then I would go for a categorization similar to what @kohr-h wrote. Within optimization methods one can also differ between the following:

Iterative schemes where the next iterate is obtained by a "line search", which is an optimization in 1D along some cleverly chosen direction based on the current, and possibly past, iterate(s).
Iterative schemes where the original optimization problem is replaced by a simpler, albeit full dimensional, optimization problem valid in a neighborhood of the current iterate. The next iterate is the solution to this simplified optimization problem.
Hybrids of 1 and 2, i.e., iterative schemes where the next iterate is obtained by a "sub-space search", which is an optimization in n-dimensions of a simplified problem where n is significantly less than the original dimension. As an example, n is in the order of 5-6.
Methods based on solving equations, such as Kuhn-Tucker equations, that characterize a local extrema.

kohr-h commented 8 years ago

Thanks for the input, @ozanoktem. Clearly the situation is not as clear-cut as my separation suggests.

Still, I think we have to make the decision based on practicality, meaning that the separation should make sense even for somebody who is not an expert on optimization. It's just a matter of being able to locate a method in reasonable time plus some mathematical backing so it's not completely arbitrary.

Regarding representation of problems, I think that would be a major undertaking. You would have to build some sort of domain specific language to handle independent and dependent variables, constraints, optimality conditions etc. so in a way, we would have to write a clone of PyOpt. Maybe at some point we need a more structured way than we have in mind now (just directly defining a functional and giving it some useful properties), but that would be over the top IMO.

So in that way, I think your suggestions can help differentiate on a file level and finding good names.

aringh commented 8 years ago

I agree with @kohr-h. In some sense it would be nice to be able to just send a problem instance to a solver and then get an answer back, however in my opinion this project is not about building an entire optimization toolbox for all-purpose use. Then we could just pip all the computations to, for example IPOPT, SNOPT, NLopt, or any other library.

Regarding the split into line-search, trust-region/surrogate and hybrid; I'm not convinced that this is the best split. In some sense, this is just algorithmic differences between how the problem is solved. I liked the idea of

Solvers for Ax = b
"Simple" solvers like Steepest decent, QN-BFGS and Newton
More advanced solvers.

As @kohr-h wrote, in order to use the last class of solvers, one really need to do some homework with paper and pen first.

adler-j commented 8 years ago

Are we happy with the current structure of the solvers package? If so, we can close this. Else we should fix it rather asap since it could be breaking.

kohr-h commented 8 years ago

Are we happy with the current structure of the solvers package?

No.

Else we should fix it rather asap since it could be breaking.

Yes, not so much for the "top" methods, but definitely for the basic ones which are used by other code parts.

So we're back to the old question above of how to classify the methods we have (and what is likely to be implemented soon). Here's a new suggestion: classify according to the problem being solved. What do we have so far?

1. Finding zeros of functions

Mainly Newton's method, but in fact also all quasi-Newton methods. I think we make an artificial distinction here between plain Newton and quasi-Newton methods like BFGS. The former takes an operator (=function) f and iteratively determines x such that f(x) = 0. The latter take an operator (=gradient of a function) grad_f and iteratively determines x such that grad_f(x) = 0. But as far as I can see, we never use the fact that the provided operator is a function gradient.

So, in summary, here I would put the following methods

Newton
All quasi-Newton schemes

Name suggestions: zero, root, find...

2. Solving least-squares problems

Many methods do this upon convergence, even if one does not want to go all the way until convergence. Note that the least squares interpretation usually only holds for a linear operator, in the nonlinear case, I don't exactly know what happens, but that I would consider a detail. So this category would encompass

Landweber
Conjugate gradient
Gauss-Newton
(not yet implemented) Kaczmarz-type methods

Name suggestions: least_squares, lsq

3. Gradient descent methods

Having an own category for this is probably a bit over the top, but it doesn't fit into the others. I guess there are more methods (variants) that could be added here, too.

Steepest descent
(perhaps to be added) Stochastic gradient method

Name suggestions: gradient, descent, grad_descent

4. Convex optimization methods

This is more or less the current advanced stuff, including sub-packages. I think they are grouped well as they are, only the name is not great.

Name suggestion: convex

ozanoktem commented 8 years ago

I think the current structure of the solvers package is far from satisfactory. Before deciding upon this one must first decide upon why we need a structure at all. For whom is it? In what sense is a structured solver package going to make the life easier? Now, one of the proposed papers that I mentioned in my e-mail to the group 2016-08-01 was on a description of the ProxImaL-ODL integration. Natural authors associated with that paper would be @adler-j , @aringh and relevant people from the Boyd group in Stanford, like Steve Diamond. The issue of imposing a structure on the solvers package, and if so, how it is to be designed, is a very natural part of that paper. I therefore suggest we make a quick and dirty structure along the lines of @kohr-h latest suggestion and postpone any more ambitions work to the aforementioned paper.

adler-j commented 8 years ago

@kohr-h, very nice writeup and I agree with your approach. My biggest worry with the naming is least_squares or similar, since some of the solvers that we would want to place there would perhaps only work for fully determined systems, and hence not be proper least squares solvers. Basically what we want is solvers for A(x) = b, where A could possibly be non-linear. I'm not sure what such solvers should be called, perhaps equation_solvers, equation or something?

@ozanoktem, I think you are overstating the significance of this. In daily life, most users would simply access all the solvers by odl.solvers.some_solver and not odl.solvers.some_package.some_solver, hence restructuring the package internally wouldn't make a difference. This is mostly a programming issue in that we want to have a nice structure of where code should be put.

kohr-h commented 8 years ago

@kohr-h, very nice writeup and I agree with your approach. My biggest worry with the naming is least_squares or similar, since some of the solvers that we would want to place there would perhaps only work for fully determined systems, and hence not be proper least squares solvers. Basically what we want is solvers for A(x) = b, where A could possibly be non-linear. I'm not sure what such solvers should be called, perhaps equation_solvers, equation or something?

Hm, I see your point. Do you have an example of a method that would be affected? As far as I understand, the current methods all produce a least squares solution upon convergence. The situation you describe would mean that a method would "misbehave" within the least squares category in the sense that it crashes or produces wrong answers for non-invertible cases. One name that came to my mind was "approximate inverse", which is somewhat a relaxed pseudo-inverse and would be slightly broader, perhaps encompassing the case you mention. Unfortunately, the name is taken by a concrete family of methods, which we also may want to implement at some point.

adler-j commented 8 years ago

Would we be to strong if we simply named it inverse?

ozanoktem commented 8 years ago

What about ml_solvers? Least-squares solutions are in fact special case of maximum likelihood solution when data likelihood is given by Gaussian distribution. This would then cover all the ART, MART, SART, SIRT, ... solvers, as well as the ML-EM solvers.

adler-j commented 8 years ago

What about ml_solvers? Least-squares solutions are in fact special case of maximum likelihood solution when data likelihood is given by Gaussian distribution. This would then cover all the ART, MART, SART, SIRT, ... solvers, as well as the ML-EM solvers.

While this is true, I'm not sure that is properly captures the application and similarities in implementation/interface between the solvers. To me, the two (ML vs least squares) imply different ways of thinking.

kohr-h commented 8 years ago

What about ml_solvers? Least-squares solutions are in fact special case of maximum likelihood solution when data likelihood is given by Gaussian distribution. This would then cover all the ART, MART, SART, SIRT, ... solvers, as well as the ML-EM solvers.

That would be a generalization, and the idea is not bad in principle. But the problem with the solvers with stricter assumptions would not be resolved by this.

kohr-h commented 8 years ago

Would we be to strong if we simply named it inverse?

That's a bit too strong in my view. On the other hand, we're solving an operator equation, so something using that information would also be possible, e.g. op_eq to be very short, or operator_eq.

ozanoktem commented 8 years ago

While this is true, I'm not sure that is properly captures the application and similarities in implementation/interface between the solvers. To me, the two (ML vs least squares) imply different ways of thinking.

I'm not sure what @adler-j means by "different ways of thinking". On the other hand, I agree that the naming ml_solvers does not capture any algorithmic similarities. If that is the goal, then one might perhaps use fixed_point_iterators, which is an algorithmic stratification.

kohr-h commented 8 years ago

On the other hand, I agree that the naming ml_solvers does not capture any algorithmic similarities. If that is the goal,

I don't think that's necessarily the goal. I deliberately categorized the methods according to what they solve, not how.

aringh commented 8 years ago

It is true that nothing has been done here for a while. My idea was to try to take care of this when then pull request #498 (functionals) is merged, in order to not create to many conflicts with this branch (this might have been a good or bad move, in any case it was my way to go about it).

During the "workshop" earlier this spring, we came up with a suggestion that was

smooth
nonsmooth
subspace but I also like @kohr-h idea. Should we settle for @kohr-h's idea?

kohr-h commented 8 years ago

Reviving this discussion that has been delayed until after the merge of Functional.

According to my proposal we could distinguish between (names are suggestions)

zero finders (names: zero(s), root(s), findzero, findroot, root_finders, ...)
equation solvers (eq_solvers, equation_solvers)
gradient / smooth methods (smooth, gradient)
convex / nonsmooth methods (nonsmooth, convex)

plus categories we don't have currently, like heuristic gradient-free methods à la Nelder-Mead or simulated annealing, or ordered subset methods.

Regarding convex vs. nonsmooth: Depends on where we want to put nonsmooth non-convex methods. If we want to put them together with the ones we have now, then nonsmooth is the better name, otherwise convex will be a better fit.

adler-j commented 8 years ago

What would separate a zero finder from a gradient method?

kohr-h commented 8 years ago

What would separate a zero finder from a gradient method?

How would you classify Newton-like methods as gradient methods? That doesn't make sense.

adler-j commented 8 years ago

Well then which ones are the gradient methods? Steepest descent, any others?

kohr-h commented 8 years ago

Conjugate gradient for example? Although it would also classify as an equation solvers.. Sigh.. Perhaps it really only make sense to distinguish between smooth and nonsmooth on the top level and do the fine grained distinction on the module level.

adler-j commented 8 years ago

We really dont need 18 folders, what does it add? I'd be fine with smooth and nonsmooth for now.

adler-j commented 8 years ago

Correction, we have three types

Equation solvers (things that solve Ax = b, like landweber)
Optimizers (things that solve min_x f(x)
- Smooth
- Nonsmooth

kohr-h commented 8 years ago

Okay, I'll restructure this part so it's done once and for all. First though, some of the "deeper" PRs like #587 need to go in.

adler-j commented 8 years ago

I'll get this done as part of #587, I've already run into issues with this as part of that.

kohr-h commented 8 years ago

Alright then, but don't forget to put the moves on separate commits! :-)

odlgroup / odl