johannesulf / nautilus

Neural Network-Boosted Importance Nested Sampling for Bayesian Statistics
https://nautilus-sampler.readthedocs.io
MIT License
73 stars 8 forks source link

numpy.linalg.LinearAlgError: Matrix is not positive definitive #34

Closed Jammy2211 closed 1 year ago

Jammy2211 commented 1 year ago

I just had a run crash with the following error:

  File "/cosma/home/dp004/dc-nigh1/cti_require/cosma/runners/uniform/serial_x3.py", line 229, in <module>
    result_1 = search_1.fit(model=model_1, analysis=analysis)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 523, in fit
    result = self.start_resume_fit(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 630, in start_resume_fit
    self._fit(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 192, in _fit
    sampler.run(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 417, in run
    self.add_bound(verbose=verbose)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 857, in add_bound
    bound = NautilusBound.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/neural.py", line 254, in compute
    multi_ellipsoid = Union.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/union.py", line 136, in compute
    bound.bounds = [bound_class.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/basic.py", line 304, in compute
    bound.B = np.linalg.cholesky(np.linalg.inv(bound.A))
  File "<__array_function__ internals>", line 180, in cholesky
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 770, in cholesky
    r = gufunc(a, signature=signature, extobj=extobj)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 92, in _raise_linalgerror_nonposdef
    raise LinAlgError("Matrix is not positive definite")
numpy.linalg.LinAlgError: Matrix is not positive definite

This is the last big of verbose output:

Adding Bound 105^MAdding Bound 105:    done
Ellipsoids:             1
Neural Networks:        4
Filling Bound 105:   done
N_like:             25700
N_eff:               1164
log Z:      -22583230.923
log V:            -74.100
f_live:             0.130

Adding Bound 106^MAdding Bound 106:    done
Ellipsoids:             1
Neural Networks:        4
Filling Bound 106:   done
N_like:             25900
N_eff:               1390
log Z:      -22583230.850
log V:            -74.777
f_live:             0.077

Adding Bound 107^MAdding Bound 107:    done
Ellipsoids:             1
Neural Networks:        4
Filling Bound 107:   done
N_like:             26100
N_eff:               1584
log Z:      -22583230.744
log V:            -74.948
f_live:             0.065

Adding Bound 108^MAdding Bound 108:    done
Ellipsoids:             1
Neural Networks:        4
Filling Bound 108:   done
N_like:             26300
N_eff:               1762
log Z:      -22583230.674
log V:            -75.338
f_live:             0.047

Adding Bound 110^MAdding Bound 110: skipped
Filling Bound 109:   done
N_like:             26900
N_eff:               2072
log Z:      -22583230.590
log V:            -75.414
f_live:             0.022

Adding Bound 110^MAdding Bound 110:    done
Ellipsoids:             1
Neural Networks:        4
Filling Bound 110:   done
N_like:             27100
N_eff:               2244
log Z:      -22583230.528
log V:            -76.140
f_live:             0.035

Adding Bound 111^MAdding Bound 111: skipped
Filling Bound 110:   done
N_like:             27400
N_eff:               2308
log Z:      -22583230.527
log V:            -76.140
f_live:             0.021

Adding Bound 111^MAdding Bound 111: skipped
Filling Bound 110:   done
N_like:             28100
N_eff:               2342
log Z:      -22583230.527
log V:            -76.140
f_live:             0.011

The analysis is probably quite hard to reproduce locally as it uses an annoying to install c++ library wrapping a Python code base.

I suspect I can fix this by altering settings (e.g. more live points or bound settings) but I figured you'd want to be aware of a potential run breaking error that can come up in case there is a try / except fix.

johannesulf commented 1 year ago

Thanks so much for reporting this! Do you have a checkpoint file you could send me? I suspect the points are distributed such that a minimum-volume enclosing ellipsoid has 0 volume to within machine precision.

Jammy2211 commented 1 year ago

A link to a checkpoint file is here:

https://drive.google.com/file/d/16LU44iwQ_dckJjfn_5_NycWk6MgPV_GE/view?usp=sharing

In case its important, the fit applies the following assertions:

parallel_trap_0.add_assertion(
    parallel_trap_0.release_timescale < parallel_trap_1.release_timescale
)
parallel_trap_1.add_assertion(
    parallel_trap_1.release_timescale < parallel_trap_2.release_timescale
)

Basically, a couple of parameters are not allowed to be above parameters, and if they are then I think that model is resampled or rejected somehow.

So, I would guess this could be creating an undesirable "edge" in parameter space.

johannesulf commented 1 year ago

Thanks! Unfortunately, the sampler in the checkpoint file is at bound 91, not 111. Do you have the checkpoint file right before it crashed?

Jammy2211 commented 1 year ago

If I resume the run, I get this verbose output:

====
Starting job 6420109 at Sun Aug 20 19:33:35 BST 2023 for user dc-nigh1.
Running on nodes: m8036
====
2023-08-20 19:34:07,584 - autofit.non_linear.search.abstract_search - INFO - Creating non-linear search
2023-08-20 19:34:07,624 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,654 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,684 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,715 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,745 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,775 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,806 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,836 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,837 - search[1]_parallel[x3]_nautilus - INFO - The output path of this fit is /cosma8/data/dp195/dc-nigh1/autocti/output/uniform/parallel_x3/quad1/search[1]_parallel[x3]_nautilus/930b2672\
ce93a1fe70da2082535f2baf
2023-08-20 19:34:07,837 - search[1]_parallel[x3]_nautilus - INFO - Outputting pre-fit files (e.g. model.info, visualization).
2023-08-20 19:36:06,444 - search[1]_parallel[x3]_nautilus - INFO - Resuming Nautilus non-linear search (previous samples found).
2023-08-20 19:36:11,272 - search[1]_parallel[x3]_nautilus - INFO - Fit Still Running: Updating results after 5000 iterations (see output folder for latest visualization, samples, etc.)
#########################
### Exploration Phase ###
#########################

Adding Bound 91

And this error:

Traceback (most recent call last):
  File "/cosma/home/dp004/dc-nigh1/cti_require/cosma/runners/uniform/parallel_x3.py", line 221, in <module>
    result_1 = search_1.fit(model=model_1, analysis=analysis)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 523, in fit
    result = self.start_resume_fit(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 630, in start_resume_fit
    self._fit(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 192, in _fit
    sampler.run(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 417, in run
    self.add_bound(verbose=verbose)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 857, in add_bound
    bound = NautilusBound.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/neural.py", line 254, in compute
    multi_ellipsoid = Union.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/union.py", line 136, in compute
    bound.bounds = [bound_class.compute(
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/basic.py", line 304, in compute
    bound.B = np.linalg.cholesky(np.linalg.inv(bound.A))
  File "<__array_function__ internals>", line 180, in cholesky
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 770, in cholesky
    r = gufunc(a, signature=signature, extobj=extobj)
  File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 92, in _raise_linalgerror_nonposdef
    raise LinAlgError("Matrix is not positive definite")
numpy.linalg.LinAlgError: Matrix is not positive definite

I had a few runs going so the 121 bound may of been a different run, but this one at bound 91 is definitely at the point where the error comes up for one run!

Jammy2211 commented 1 year ago

Sorry that should of been bound 111 not 121.

johannesulf commented 1 year ago

Interesting! Resuming from the file you sent me doesn't cause a crash for me. But let me look closer into this.

Jammy2211 commented 1 year ago

In case its useful here is the checkpoint file which crashed at bound 111:

https://drive.google.com/file/d/1eWq0iCnTh6nuCjZkiPh5H9cgX55-RVf8/view?usp=sharing

johannesulf commented 1 year ago

The issue has to do with numerical stability. The current implementation of finding a minimum-volume enclosing ellipsoid is a bit naive with respect to rounding errors. Can you try the stability branch and check whether this solves your issues? It should be more stable.

johannesulf commented 1 year ago

@Jammy2211 I merged the branch into main. Hopefully, this will solve the issue you're encountering. But I'll leave this open until you confirm.

Jammy2211 commented 1 year ago

It fixed the error and the analyse has converged, however it converged on a solution that is incorrect (much lower likelihood than a dynesty run).

I am doing some runs with more live points (they take > 12 hours so not results yet) so will let you know.

Is it conceivable that insufficient numerical stability would lead one to infer a local maxima?

johannesulf commented 1 year ago

It's not impossible that the previous numerical instability may have caused this. Is the maximum likelihood nautilus infers at a very different location than what dynesty found? In principle, though, I would definitely recommend running with significantly more than 100 live points.

Jammy2211 commented 1 year ago

It's not impossible that the previous numerical instability may have caused this. Is the maximum likelihood nautilus infers at a very different location than what dynesty found?

Nautilus solution:

Bayesian Evidence                                                               -22466472.97699836
Maximum Log Likelihood                                                          -22466399.44634518

Summary (3.0 sigma limits):

cti
    serial_trap_list
        0
            density                                                             0.22 (0.21, 0.22)
            release_timescale                                                   2.07 (2.04, 2.09)
        1
            density                                                             4.18 (4.12, 4.24)
            release_timescale                                                   16.41 (16.29, 16.53)
        2
            density                                                             2.81 (2.74, 2.88)
            release_timescale                                                   32.87 (32.45, 33.25)
    serial_ccd
        well_fill_power                                                         0.57 (0.57, 0.57)
        well_notch_depth                                                        -44.38 (-47.78, -41.19)

Dynesty solution:

Bayesian Evidence                                                               -22409030.73498726
Maximum Log Likelihood                                                          -22408926.04040357

Summary (3.0 sigma limits):

cti
    serial_trap_list
        0
            density                                                             0.07 (0.07, 0.08)
            release_timescale                                                   0.81 (0.79, 0.83)
        1
            density                                                             0.22 (0.21, 0.23)
            release_timescale                                                   4.07 (3.91, 4.23)
        2
            density                                                             6.55 (6.54, 6.55)
            release_timescale                                                   20.02 (19.96, 20.07)
    serial_ccd
        well_fill_power                                                         0.58 (0.58, 0.58)
        well_notch_depth                                                        0.14 (-0.10, 0.44)

So, quite different.

In principle, though, I would definitely recommend running with significantly more than 100 live points.

We are very much run-time limited, so my strategy has been to figure out what works for the fewest live points as this seems to give the fastest run times. I've probably just overshot it on this occation, but we will see!

I am rerunning with 200 live points, and will go above that in subsequent tests if the solutions continue to be incorrect.

The settings I'm using work beautifully for my strong lens analysis for models with this many parameters (this is a different problem) so to some extent it seems to be specific to the problem at hand.

johannesulf commented 1 year ago

Yes, looks like nautilus got stuck in a local maximum here. That's always possible, like for any global optimizer. I don't think it has to do with the original numerical stability issue. More live points would help. At the same time, there's also a randomness to it. It may work with the same live points but different random seed.

Jammy2211 commented 1 year ago

It continued to infer local maxima for n_live=100, n_live=200 and n_live=400, but found the correct solution for n_live=1600.

This is a likelihood function that I've found has been synonymously difficult to sample accurate (after many attempts with emcee, MultiNest and dynesty).

With my dynesty set up, it finds the solution with half as many iterations (100000 versus 200000). I'll see what happens with 800 live points.

Either way, the main problem is solved, thanks!