Mismatch between the code and mathematical properties of the functions

duboism commented 10 years ago

Fouad and I have found some difficulties to implement new estimators with parsimony. It's not impossible (therefore we can still aim at releasing 1.0) but some parts are a bit hard to grasp and there is some mismatch between the mathematics and the implementation.

For instance with ExcessiveGapMethod the function must implement the StronglyConvex interface and therefore the parameter method (which returns the strong convexity constant). The problem is that from a mathematical point of view only the smooth part is strongly convex. Therefore, most implementations (for instance RR_SmoothedL1TV in functions.combinedfunctions) simply return g.parameter() (where g is the smooth part). Similar issues may happen with the gradient or other properties.

After discussion with Fouad and Édouard we think it's a bit confusing. The problem is linked with the fact that algorithms expected only one function which then must encapsulate every information. Therefore we wanted to discuss about this (this could cause a lot of changes therefore we should discuss this carefully). We noticed that:

most problems we are interested in are of the form objective = smooth part + non-smooth part (some like CONESTA could need more)
most algorithms really need to access those 2 functions

What do you think of that?

I have some ideas to improve the situation that I can describe in comments but we should probably first agree on the issue.

tomlof commented 10 years ago

Could you specify why this is a problem for you in your current work, when you want to add another estimator?

The possible issue with this design that I see is: What happens if some algorithm in the future requires e.g. gradients from two parts. If f = g + h, and the algorithm needs to access both g.grad and h.grad, how should we access those from within the algorithm?

Note that this problem is reduced somewhat if you use the CombinedFunction, since it separates smooth functions and penalties from non-smooth (proximal operators) and constraints (through projections [though, not yet working properly]).

Currently, the idea is to let the (combined) function make the separation between smooth and non-smooth parts (or whatever parts there are), the algorithm should just ask for whatever it needs and the functions provide the "correct" result.

I think the reason for this was to have a unified API for the algorithms as algorithm(function, start_vector). In early versions of the library, some algorithms (e.g. ISTA and FISTA) had the signature algorithm(smooth_function, non-smooth_function, start_vector), but there were problems with that design in terms of polymorphism.

Aaah, I just remembered, the reason this was not working was when we needed to combine properties of two functions. The algorithms became too complex, and unnecessary "cleverness" had to be introduced inside all algorithms. The combined functions (e.g. the combination of LR + L1 + L2 + TV in combinedfunctions.py) was the solution to that problem.

Thus, we have already discussed this and concluded that we needed to create the combined functions ;-) However, it would be great if we could iterate this discussion again, maybe we can think of a better solution!

vguillemot commented 10 years ago

I think I am remembering a discussion about this difference between the Excessive Gap Method and CONESTA/FISTA/ISTA : it has something to do with the fact that both the A matrix and the dual vector for the Excessive Gap method combine the A matrices and dual vectors of the non smooth penalties (e.g. L1 + TV) that is not at all straightforward.

It is maybe more clearly written in sections 3.1 and 3.2 of the CONESTA article : you can see in these sections that it is going to be tricky to include the "lambdas" in a flexible way.

I hope this helps and that I understood correctly.

duboism commented 10 years ago

@tomlof simply speaking the problem is that some functions must implement some interfaces which mathematically don't make sense. For instance RR_SmoothedL1TV is declared StronglyConvex while it's not (only the ridge part is). Of course the implementation is correct but non-intuitive. We have to do the same kind of things in our case.

As you said, the problem is related to polymorphism and the solutions I have in mind will modify this (I will post them soon). I think it doesn't forbid the creation of functions that are the weighted sum of two or more functions.

@vguillemot I'm not sure to understand your point. I will try to rephrase your comment in order to understand it. You mean that we can't have simple rules to compute the A matrix for L1+TV and therefore it's easier to implement them in a an object (to make all the tricky computation inside in an ad-hoc way). Is that correct? In this case, as I said earlier in this message, I think we can still create such classes.

tomlof commented 10 years ago

If f = h + g and only h is strongly convex, while g is "only" convex, then f is still a strongly convex function.

However, with the gradient and proximal operator, your problem description applies. Let f = g + h, with g smooth and h non-smooth with a proximal operator. Then f.grad = g.grad and f.prox = h.prox in the current implementation.

I am not sure this is a huge problem; it seems rather a problem of documenting the functions properly. The documentation should make it clear what is actually returned by each function call.

duboism commented 10 years ago

Here are my thoughts on the issue maybe it will make the discussion clearer. It's a bit long so take time to read them.

Once again, the problem is that, on the code, the argument to the run function must have all the properties while mathematically speaking only the smooth part (g) or the non-smooth part (h) have those properties. The interfaces for function properties (StronglyConvex, Nesterov) is nice so we should keep it but I think the CombinedFunction is a bit complex.

Another point to keep in mind is that, in the current situation, estimators can use any compatible algorithm. This is nice since we can construct the algorithm with the right option (for instance to retrieve the informations) and pass it to the estimator. However I think that in practice estimators will only work with a limited number of algorithms.

I have several ideas to try to improve the situation:

The first proposal would be to modify the run function of the algorithms to accept the right number of functions (and check their properties). The estimator would then be in charge of calling the algorithm properly. This is simple but should be done for every algorithm. As @tomlof mentioned this is clearly against polymorphism
The second proposal is to introduce an abstract class called Problem and some concrete subclasses. Algorithms would then minimize an instance of these classes. For instance for problems with a smooth and a non-smooth part (without other properties) we could have a class SmoothNonSmoothProblem constructed with 2 function. Of course the constructor will have to check the properties of the functions are correct (all the interfaces for the functions could still be used). I don't know if other classes are needed but it could be (for instance CONESTA uses a smooth part, a Nesterov part and a part with proximal gradient). This could also simplify the check of the matching between an algorithm and a problem.
The third proposal would be to create classes of algorithms (each class is able to solve a particular problem). I'm not sure it is always possible to do that with simple inheritance and we should probably avoid to add another set of interfaces for algorithm.

It's not clear yet what are the advantages/disadvantages of each solution (and they are not mutually exclusive). I think solution 2 is close to the idea of CombinedFunction but clearly states the combination (and the algorithm are still polymorph). I would explore it first.

duboism commented 10 years ago

@tomlof I agree with you and I should have mentioned the grad and prox earlier (I realized the problem when working on the strong convexity but we soon realized that it also applies to those). As I said the code work. We just think we could do a bit better.

Édouard told me that in the first versions of parsimony, algorithms used different calls and I understand the problem it can create. In my humble opinion, solution 2 could have the best of the 2 worlds.

vguillemot commented 10 years ago

What I'm saying, but maybe I completely misunderstood the problem, is that this solutions reflects the fact that ISTA/FISTA and the Excessive Gap Algorithm (and in the future other algorithms) need slightly different "mathematical" tools in order to work, and that it is very tricky to try to factor them. What is smoothed in one case could not be in the second and depending on what you smooth, you need to know about different parts of the objective function (sometimes all of them)...

Anyway, @tomlof knows better, so I guess I'll be quiet from now on about this issue.

duboism commented 10 years ago

@vguillemot I understand you point and I agree: factor them is hard and not factor them also leads to some problems.

If anybody has an advice on my proposals, I would be happy to discuss them but it's not urgent.

tomlof commented 10 years ago

I have been thinking about this during the weekend, but I have not been able to come to a conclusion about what would be a good way to do it.

We could add a Problem class, that would make sense, since some properties of the Functions are currently not really properties of actual mathematical functions, but rather properties some optimisation problem. E.g. CombinedFunction would really be a Problem.

However, while that would be a semantic improvement, I'm still not sure about how that would change anything practically.

Perhaps you could give a high-level example of how you would like to do it?

neurospin / pylearn-parsimony_history

Mismatch between the code and mathematical properties of the functions #27