jump-dev / ECOS.jl

A Julia interface to the ECOS conic optimization solver
https://github.com/embotech/ecos
Other
41 stars 17 forks source link

Subtle solution difference upgrading from Julia v1.6.1 --> v1.7.1 causes my iterative solver to fail #133

Closed dmalyuta closed 2 years ago

dmalyuta commented 2 years ago

Hello team, thanks for developing ECOS.jl. I'm writing a new package for sequential convex programming, it's called the SCP Toolbox. I struck on a very subtle issue in ECOS related to a Julia version upgrade from v1.6.1 to v1.7.1. Even though all installed package versions don't change, the behavior of ECOS changes very slightly. Because my package is iterative, it seems that a "numerically not-so-stable" unit test in my package fails simply due to the version upgrade.

First things first, I am on Ubuntu:

$ uname -r
5.11.0-40-generic
$ uname -a
Linux danylo-XPS-13-9360 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The Julia version where things work is v1.6.1, and where things break is v1.7.1. The unit test under question is:

https://github.com/dmalyuta/scp_traj_opt/blob/bugfix/ecos-numerical-error/test/runtests.jl#L78

In particular, the test that fails occurs here:

https://github.com/dmalyuta/scp_traj_opt/blob/bugfix/ecos-numerical-error/test/examples/rendezvous_3d/tests.jl#L215

You can run the code for yourself by downloading the repository and running ] test in the Julia REPL for v1.6.1 and v1.7.1. I have also attached directly the stdout from testing both versions. You can see that the iterations follow each other very closely up until iteration 13 (of my SCP algorithm that is, not ECOS' interior point method iteration). At that point, ECOS under Julia v1.6.1 stops short with "Close to OPTIMAL" status whereas in v1.7.1 it actually finds the OPTIMAL solution. This divergence in behavior unfortunately causes the v1.6.1 version to achieve OPTIMAL on iteration 14, while v1.7.1 stops short with "NUMERICAL PROBLEMS" on iteration 14.

I think that this is an interesting bug because the package versions remain the same for both runs, only the underlying Julia language is "newer". In optimization we obviously never want to see a situation where an upgraded environment suddenly changes convergence behavior.

If you need to know something else about this issue, please let me know.

stdout_julia_v161.txt stdout_julia_v171.txt

mlubin commented 2 years ago

In optimization we obviously never want to see a situation where an upgraded environment suddenly changes convergence behavior.

Floating-point computations can depend on a variety of environmental factors like the compiler versions, math libraries, BLAS libraries, etc. I don't find the change in convergence behavior particularly surprising. It would be a good exercise to trace through the code in ECOS to see what causes the divergence, but my guess is that we won't find a bug here.

odow commented 2 years ago

Julia's BLAS changed between 1.6 and 1.7, but ECOS_jll has no external dependencies so I'm not sure that's the problem.

There were also changes to the random number generation. Did you check that your Julia code is deterministic under Julia 1.6 and 1.7? The most likely culprit is that you aren't passing bit-for-bit identical models to ECOS between Julia 1.6 and 1.7.

dmalyuta commented 2 years ago

@odow that's a good callout, maybe the inputs to ECOS are not exactly the same if my code produces slightly different outputs due to the BLAS change. Even if ECOS doesn't depend on it, my external code that wraps ECOS probably does. Where in Julia is BLAS used? Is there a list, or some other way to know, which functions call it?

odow commented 2 years ago

BLAS will probably be used if you call any linear algebra-related calls. There's no easy way to isolate where and if it is called.

I think you should focus on the underlying issue: your code should be robust to these differences. You should not expect to have identical performance when changing versions or machines.

odow commented 2 years ago

Closing because this doesn't seem like an issue with ECOS and there isn't any thing actionable to do here. If you can come up with a reproducible example demonstrating an issue in ECOS, please re-open.