jbrea / CMAEvolutionStrategy.jl

Other
29 stars 0 forks source link

StackOverflow error on v0.2.5 but not v0.2.4 #7

Closed Denis-Titov closed 1 year ago

Denis-Titov commented 1 year ago

Hi,

Apologies in advance but I couldn't find a MWE.

I used 0.2.3 version for some time and everything was great but after upgrading to 0.2.5 I started getting StackOverflow. The whole error message is only ERROR: StackOverflowError: I tried to reproduce it with rosenbrock but that works fine. My optimization is the least squares fitting of kinetic rate equation with 26 kinetic constants to ~500 data points that I repeat 20 times to make sure I'm close to the global minimum. I looked through the commits and it seems like there was only one change not in tests that was about stability. I'm happy to try a few fixes if you have ideas but since I don't MWE, I understand this might be difficult. Also happy to send you my code but it's about 300 lines so might be a lot of work to look through. I'll close this if you feel it'll be hard to fix without MWE. I'm using 0.2.4 and everything works well.

jbrea commented 1 year ago

Thanks for reporting!

I suspect the problem is the new version of MEigen. Does the problem persist with the newest commit? You can load the newest version like this:

using Pkg
Pkg.add(name = "CMAEvolutionStrategy", rev = "376fa68")

If you prefer you can also send me your code and I could try myself.

Denis-Titov commented 1 year ago

StackOverflow error is gone with the new commit but unfortunately I cannot reproduce it on v0.2.5 anymore either 🤦‍♂️ The error was very robust before appearing every single time when I run that optimization before and as far as I can tell I used exactly the same code. Sorry for wasting your time... maybe something else was going on on my computer that in some weird way caused this error before or maybe some other got updated causing the error to go away.

Denis-Titov commented 1 year ago

UPDATE:

I could reproduce the StackOverflow on a cluster with a larger number of optimization runs. One of my runs will have about 100x10x20=20,000 optimizations. With v0.2.5: 2 out of 4 runs had StackOverflow error With 376fa68: 1 out of 4 runs had StackOverflow error With v0.2.4: I've never seen StackOverflow error in 100+ runs

Not clear if the difference between v0.2.5 and 376fa68 is significant due to how rare this error is. But definitely, v0.2.4 doesn't exhibit the error. Sorry, I can't be more helpful here.

If you want to try other fixes, I can run them on my code but due to how rare the error is, I'm not sure if it's worth it as I'll have to run for a long time to be confident.

jbrea commented 1 year ago

Thanks a lot for the update! This is very helpful. Do you still get StackOverflow errors with 19eae17 ?

Denis-Titov commented 1 year ago

19eae17 seemed to have done something. I rerun the same analysis 10 times (~100,000 optimization each), and I did not get any StackOverflow errors. I'll let you know if I encounter this error again in future but it seems to have been fixed (or at least improved) by 19eae17.

Out of curiosity, what was the rationale for the https://github.com/jbrea/CMAEvolutionStrategy.jl/commit/c26ee4fd04fe98ced8f0f91c485b4fb103656af7 that presumably led to this rare error? What "stability" did it improve?

jbrea commented 1 year ago

Great, thanks for the feedback!

Out of curiosity, what was the rationale for the c26ee4f that presumably led to this rare error? What "stability" did it improve?

CMA-ES assumes a positive definite covariance matrix, but in rare cases this isn't satisfied. To prevent failures because of non-positive-definiteness, I used an unjustified heuristic prior to c26ee4f, but I noticed that this had a negative (but small) effect on some results. Starting with c26ee4f, the covariance matrix is only changed, when it isn't positive definite: it is changed by adding the identity matrix multiplied by some small constant, until the covariance matrix is positive definite (this is done recursively, which can result in a StackOverflow error). With 19eae17 the small constant is multiplied by 10 in each recursion, which should be sufficient in all reasonable cases to prevent stack overflow.