Closed jschueller closed 1 year ago
Scaling can be done automatically for fully bounded variables. Using a non-linear transform (like arcsine or tanh) may not be a good idea for the convergence. The main reason to scale variables is to have similar magnitude for the changes of all variables of the problem and, for example, avoid that the convergence depends on the units of the variables. This is ususal trick when applying a numerical method to a problem with heterogeneous variables.
Hi @jschueller and @emmt ,
Apologies for the late reply.
First of all, I totally agree that scaling is important. Thank you for raising this point.
However, I do not think a trivial and good scaling strategy exists.
Similar to what @emmt pointed out, it is generally wrong to scale using a nonlinear transformation. I am not saying that it is always wrong --- a certain transformation may work extremely well on a particular class of problems (see the example at the end of this comment), but a nonlinear transformation that does not fit the problem will likely make the problem become more difficult.
It is hard to find a scaling strategy that fits all problems. For example, consider the following problems.
1.) No constraints. The starting point is $0$.
2.) No constraints. The starting point is $10^{-3}$.
3.) No constraints. The starting point is $10^{-16}$.
4.) No constraints. The starting point is $10^{10}$.
5.) The bound constraint is $-10^{10} \le x \le 10^{10}$. The starting point is $10^{10}$.
6.) The bound constraint is $10^{10}-1 \le x \le 10^{10}+1$. The starting point is $10^{10}$.
7.) The bound constraint is $-\infty < x \le 10^{10}+1$. The starting point is $10^{10}$.
The best scaling strategy is only known to the user, who should scale the problem before passing it to the solver. If he/she does not do so, I suggest considering defining the scaling factor with the following strategy or something similar. The factor is defined entrywise. Here, $t$ is a small value, e.g., $10^{-3}$.
1.) The bound constraint is $l \le x \le u$: $(u-l)/2$ provided that this value is at least eps (otherwise, we consider $x$ should be fixed at (l+u)/2 and remove this constraint. 2.) The bound constraint is $-\inf < x \le u$: $\max(|x_0|, |u|)/2$ provided that this value is at least $t$. 3.) The bound constraint is $l \le x < \inf$: $\max(|x_0|, |l|)/2$ provided that this value is at least $t$. 4.) No bound constraint: $|x_0|/2$ provided that this value is at least $t$.
There should be a flag to switch on or off the scaling. I suggest switching it off by default.
Note that the above strategy does not work well in the following case, where the problem probably needs a nonlinear transformation (taking logarithm) before any scaling (recall point 1 above).
Thanks.
3. The best scaling strategy is only known to the user, who should scale the problem before passing it to the solver. 4. There should be a flag to switch on or off the scaling. I suggest switching it off by default.
I totally agree with these 2 points, this is why I would add the scaling as an optional argument (e.g. a keyword in Julia interface) whose values (the scaling factors for each variables of the problem) are chosen by the caller.
If the optional scaling is specified as a single value, we could assume that it is the parameter t
in the above discussion and compute the scaling factors following what you propose and according to the values of the initial variables and the bounds (x0
, l
, and u
).
If the optional scaling is specified as a single value, we could assume that it is the parameter t in the above discussion and compute the scaling factors following what you propose and according to the values of the initial variables and the bounds (x0, l, and u).
The scaling scheme involving t
is only a very rough idea. I am not sure whether it is the best scaling we can have without information from the user. However, I do not think it is a good idea to let users input t
in any way. It is a parameter that is too technical and should not be exposed to the user.
The MATLAB interface does the following:
scale
, which is defaulted to false. scale
is true, then scale the problem in the best way that we can figure out according to the data (e.g., using x0
, lb
, ub
as elaborated above). We do not accept a scaling
that is input by the user --- if the user does know the scaling, then he/she should have scaled the problem before sending it to the solvers. It is not something complicated/error-prone to do and it is unnecessary for the solvers to make things become complicated (e.g., iprint
as you noted) by handling such things.
In other words, we only handle two cases:
Thanks.
We do not accept a
scaling
that is input by the user --- if the user does know the scaling, then he/she should have scaled the problem before sending it to the solvers. It is not something complicated/error-prone to do and it is unnecessary for the solvers to make things become complicated (e.g.,iprint
as you noted) by handling such things.
OK but then this means that, internaly, the Powell's algorithms are capable of dealing with non-unit scaling of variables (in particular for the iprint
stuff), so why not let the user specify scaling factors if she/he has good reasons to do so and do not want to cope with the complicated task of correctly dealing with these factors in the objective function. An example that comes to my mind is model fitting with heterogeneous parameters: most users would like to express their objective function as simply as possible considereng their units of choice for the variables but also have an idea of the typical magnitude of the variables at the solution. This information can easily be turned into scaling factors. This was my proposition.
OK. I agree.
What about the following?
Define two options, scale
and scaling_factor
, the first one being logical/boolean, and the second being a positive vector of length n
(dimension of the problem).
If the user inputs scale = true
or does not input it, and specifies scaling_factor
, then scale the problem as
x_before_scaling = scaling_factor * x_after_scaling + x0
where *
is entry-wise.
If the user inputs scale = true
but does not specify scaling_factor
, then we define the scaling factor in a certain way (e.g., the way outlined above) and do the scaling as in 2.
If the user inputs scale = false
but also inputs scaling_factor
, then raise a warning and do not do the scaling.
If the user inputs scaling_factor
, we need to verify its validity unless scale
is set to false
. A scaling factor is valid if each component is positive and finite. If any component is invalid, then we raise a warning and define the scaling factor for this component in our own way. All the valid components of the user-defined scaling factor will be used as is.
The problem with iprint
is not solvable. The best we can do is to mention in the documentation that the information printed by the Fortran backend corresponds to the scaled problem if the scaling does happen.
Since, in Julia, we have access to the type of the arguments/keywords, we can combine your suggested logic in a single keyword, say scale
, so that when scale = true
or, better, scale = :auto
(where :auto
is the symbolic name auto in Julia), then some automatic scaling strategy is applied (along your point 2), otherwise scale
may be set with a vector of scaling factors, or left to its default value false
or :none
(again a Julia symbolic name). Using symbolic names rather than true
or false
is more informative about what will be done regarding the scaling of the variables and left open the possibility to implement different strategies identified by their names. Besides, having the scaling of the variables controled by a single keyword is easier to understand for the users.
If the user inputs
scaling_factor
, we need to verify its validity unlessscale
is set tofalse
. A scaling factor is valid if each component is positive and finite. If any component is invalid, then we raise a warning and define the scaling factor for this component in our own way. All the valid components of the user-defined scaling factor will be used as is.
I agree that scaling factors must be checked for validity but I would rather throw an exception (i.e. raise an error rather than a warning) if they are invalid because it means that the caller has made a wrong choice and must fix that.
Since, in Julia, we have access to the type of the arguments/keywords, we can combine your suggested logic in a single keyword, say
scale
, so that whenscale = true
or, better,scale = :auto
(where:auto
is the symbolic name auto in Julia), then some automatic scaling strategy is applied (along your point 2), otherwisescale
may be set with a vector of scaling factors, or left to its default valuefalse
or:none
(again a Julia symbolic name). Using symbolic names rather thantrue
orfalse
is more informative about what will be done regarding the scaling of the variables and left open the possibility to implement different strategies identified by their names. Besides, having the scaling of the variables controled by a single keyword is easier to understand for the users.
This is an excellent idea!
I agree that scale = :auto
conveys the idea better than scale = true
.
If the user inputs
scaling_factor
, we need to verify its validity unlessscale
is set tofalse
. A scaling factor is valid if each component is positive and finite. If any component is invalid, then we raise a warning and define the scaling factor for this component in our own way. All the valid components of the user-defined scaling factor will be used as is.I agree that scaling factors must be checked for validity but I would rather throw an exception (i.e. raise an error rather than a warning) if they are invalid because it means that the caller has made a wrong choice and must fix that.
I tend to disagree with this point. The reason is as follows.
When developing a solver (be it optimization, linear algebra, PDE, ...), we should keep an important fact in mind: the "caller" is not necessarily a human. It is very likely another piece of code that calls the solver. The invocation of the solver is probably a quite small part of a bigger project. Moreover, the invocation of the solver may be done repeatedly in a loop.
For example, consider the following pseudo-code.
do k = 1, 2, ..., until x_k achieves some kind of convergence
x_k = X(x_1, ..., x_{k-1}, other data)
obj_k = O(x_1, ..., x_k, other data)
scale_k = S(x_1, ..., x_k, other data)
solver(obj_k, x_k, scale_k)
end do
Suppose that, in theory, it is guaranteed that scale_k
is always valid. In computation, scale_k
may become invalid (some components become zero or small negative values) due to ill-conditioning and rounding errors, even though the "some kind of convergence" is still far from being achieved. In this situation, should we raise an error and terminate the computation, or should we raise a warning, correct the scale_k
, and continue with the loop?
In my opinion, a user-friendly package should tolerate any erroneous options as long as the problem being solved is still well-defined. If an option is invalid, then raise a warning and reset it to a correct one. (Of course, exceptions exist.)
Indeed, this happens quite often for linear solvers. Consider a problem where the coefficient matrix is theoretically nonsingular. In practice, the matrix may become numerically singular due to ill-conditioning. If this happens, should we abort the computation and raise an error, or should we raise a warning and solve the problem in the best way we can? I would prefer the latter. (In this case, the problem being solved is even not well-defined in the classical sense).
What do you think? Thanks.
Well, for me, such an error indicates an erroneous use of the program which I would not left pass through. My experience is it's usually a bad idea to sweep this kind of mistake under the carpet because it delays the moment when one realizes the error and correctly fixes it.
In the case of your example, a possible explicit fix of scale_k
is to replace the call to solver
by:
solver(obj_k, x_k, max(scale_k, eps))
assuming max
operates elementwise and where eps
is any suitable positive threshold. Many other strategies are possible, for instance fixing the problem only if the issue is raised (you can generally catch errors in high-level languages).
We don't have to make a decision right now, especially not until we've defined a rescue strategy. I would propose to cope with this in several steps:
At least implementing step 1 is a prerequisite.
such an error indicates an erroneous use of the program which I would not left pass through.
I agree that some errors likely indicate an erroneous use of the program, which should not be left pass through. This is what I mean by the exceptions when saying the following.
In my opinion, a user-friendly package should tolerate any erroneous options as long as the problem being solved is still well-defined. If an option is invalid, then raise a warning and reset it to a correct one. (Of course, exceptions exist.)
However, I am not sure whether an invalid scale
falls into this category or not. To me, if scale
is defined by some code using some data, an invalid scale
sounds similar to a numerically singular matrix in the case of linear systems.
My experience is it's usually a bad idea to sweep this kind of mistake under the carpet because it delays the moment when one realizes the error and correctly fixes it.
I agree with this idea in general. That is why users should always be informed when some inputs are erroneous, eitherby a warning (in most cases) or by an error (in exceptional cases).
We don't have to make a decision right now, especially not until we've defined a rescue strategy. I would propose to cope with this in several steps:
I agree. Anyway, the particular question of "how to deal with an invalid scale
" is not a quite big issue (but the question of "how to deal with invalid inputs in general" is important).
Thanks.
it turns out i'm not so much interested in that anymore
the matlab interface scales the problem only if all variables are bound on each side however it still possible to transform unbounded variables with arcsine (or tanh), for example: https://lmfit.github.io/lmfit-py/bounds.html what do you think ?
in the julia interface we were wondering if this should be done in the fortran or C layer https://github.com/emmt/PRIMA.jl/issues/4#issuecomment-1740971230 a concern by @emmt was that iprint would print scaled variables