Closed Jammy2211 closed 1 year ago
Thanks so much for reporting this! Do you have a checkpoint file you could send me? I suspect the points are distributed such that a minimum-volume enclosing ellipsoid has 0 volume to within machine precision.
A link to a checkpoint file is here:
https://drive.google.com/file/d/16LU44iwQ_dckJjfn_5_NycWk6MgPV_GE/view?usp=sharing
In case its important, the fit applies the following assertions:
parallel_trap_0.add_assertion(
parallel_trap_0.release_timescale < parallel_trap_1.release_timescale
)
parallel_trap_1.add_assertion(
parallel_trap_1.release_timescale < parallel_trap_2.release_timescale
)
Basically, a couple of parameters are not allowed to be above parameters, and if they are then I think that model is resampled or rejected somehow.
So, I would guess this could be creating an undesirable "edge" in parameter space.
Thanks! Unfortunately, the sampler in the checkpoint file is at bound 91, not 111. Do you have the checkpoint file right before it crashed?
If I resume the run, I get this verbose output:
====
Starting job 6420109 at Sun Aug 20 19:33:35 BST 2023 for user dc-nigh1.
Running on nodes: m8036
====
2023-08-20 19:34:07,584 - autofit.non_linear.search.abstract_search - INFO - Creating non-linear search
2023-08-20 19:34:07,624 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,654 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,684 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,715 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,745 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,775 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,806 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,836 - autocti.charge_injection.model.analysis - INFO - PRELOADS - Noise Normalization preloaded for model-fit (noise-map is fixed).
2023-08-20 19:34:07,837 - search[1]_parallel[x3]_nautilus - INFO - The output path of this fit is /cosma8/data/dp195/dc-nigh1/autocti/output/uniform/parallel_x3/quad1/search[1]_parallel[x3]_nautilus/930b2672\
ce93a1fe70da2082535f2baf
2023-08-20 19:34:07,837 - search[1]_parallel[x3]_nautilus - INFO - Outputting pre-fit files (e.g. model.info, visualization).
2023-08-20 19:36:06,444 - search[1]_parallel[x3]_nautilus - INFO - Resuming Nautilus non-linear search (previous samples found).
2023-08-20 19:36:11,272 - search[1]_parallel[x3]_nautilus - INFO - Fit Still Running: Updating results after 5000 iterations (see output folder for latest visualization, samples, etc.)
#########################
### Exploration Phase ###
#########################
Adding Bound 91
And this error:
Traceback (most recent call last):
File "/cosma/home/dp004/dc-nigh1/cti_require/cosma/runners/uniform/parallel_x3.py", line 221, in <module>
result_1 = search_1.fit(model=model_1, analysis=analysis)
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 523, in fit
result = self.start_resume_fit(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 630, in start_resume_fit
self._fit(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 192, in _fit
sampler.run(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 417, in run
self.add_bound(verbose=verbose)
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/sampler.py", line 857, in add_bound
bound = NautilusBound.compute(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/neural.py", line 254, in compute
multi_ellipsoid = Union.compute(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/union.py", line 136, in compute
bound.bounds = [bound_class.compute(
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/nautilus/nautilus/bounds/basic.py", line 304, in compute
bound.B = np.linalg.cholesky(np.linalg.inv(bound.A))
File "<__array_function__ internals>", line 180, in cholesky
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 770, in cholesky
r = gufunc(a, signature=signature, extobj=extobj)
File "/cosma/home/dp004/dc-nigh1/PyAutoArctic2/lib/python3.10/site-packages/numpy/linalg/linalg.py", line 92, in _raise_linalgerror_nonposdef
raise LinAlgError("Matrix is not positive definite")
numpy.linalg.LinAlgError: Matrix is not positive definite
I had a few runs going so the 121 bound may of been a different run, but this one at bound 91 is definitely at the point where the error comes up for one run!
Sorry that should of been bound 111 not 121.
Interesting! Resuming from the file you sent me doesn't cause a crash for me. But let me look closer into this.
In case its useful here is the checkpoint file which crashed at bound 111:
https://drive.google.com/file/d/1eWq0iCnTh6nuCjZkiPh5H9cgX55-RVf8/view?usp=sharing
The issue has to do with numerical stability. The current implementation of finding a minimum-volume enclosing ellipsoid is a bit naive with respect to rounding errors. Can you try the stability
branch and check whether this solves your issues? It should be more stable.
@Jammy2211 I merged the branch into main
. Hopefully, this will solve the issue you're encountering. But I'll leave this open until you confirm.
It fixed the error and the analyse has converged, however it converged on a solution that is incorrect (much lower likelihood than a dynesty run).
I am doing some runs with more live points (they take > 12 hours so not results yet) so will let you know.
Is it conceivable that insufficient numerical stability would lead one to infer a local maxima?
It's not impossible that the previous numerical instability may have caused this. Is the maximum likelihood nautilus infers at a very different location than what dynesty found? In principle, though, I would definitely recommend running with significantly more than 100 live points.
It's not impossible that the previous numerical instability may have caused this. Is the maximum likelihood nautilus infers at a very different location than what dynesty found?
Nautilus solution:
Bayesian Evidence -22466472.97699836
Maximum Log Likelihood -22466399.44634518
Summary (3.0 sigma limits):
cti
serial_trap_list
0
density 0.22 (0.21, 0.22)
release_timescale 2.07 (2.04, 2.09)
1
density 4.18 (4.12, 4.24)
release_timescale 16.41 (16.29, 16.53)
2
density 2.81 (2.74, 2.88)
release_timescale 32.87 (32.45, 33.25)
serial_ccd
well_fill_power 0.57 (0.57, 0.57)
well_notch_depth -44.38 (-47.78, -41.19)
Dynesty solution:
Bayesian Evidence -22409030.73498726
Maximum Log Likelihood -22408926.04040357
Summary (3.0 sigma limits):
cti
serial_trap_list
0
density 0.07 (0.07, 0.08)
release_timescale 0.81 (0.79, 0.83)
1
density 0.22 (0.21, 0.23)
release_timescale 4.07 (3.91, 4.23)
2
density 6.55 (6.54, 6.55)
release_timescale 20.02 (19.96, 20.07)
serial_ccd
well_fill_power 0.58 (0.58, 0.58)
well_notch_depth 0.14 (-0.10, 0.44)
So, quite different.
In principle, though, I would definitely recommend running with significantly more than 100 live points.
We are very much run-time limited, so my strategy has been to figure out what works for the fewest live points as this seems to give the fastest run times. I've probably just overshot it on this occation, but we will see!
I am rerunning with 200 live points, and will go above that in subsequent tests if the solutions continue to be incorrect.
The settings I'm using work beautifully for my strong lens analysis for models with this many parameters (this is a different problem) so to some extent it seems to be specific to the problem at hand.
Yes, looks like nautilus got stuck in a local maximum here. That's always possible, like for any global optimizer. I don't think it has to do with the original numerical stability issue. More live points would help. At the same time, there's also a randomness to it. It may work with the same live points but different random seed.
It continued to infer local maxima for n_live=100
, n_live=200
and n_live=400
, but found the correct solution for n_live=1600
.
This is a likelihood function that I've found has been synonymously difficult to sample accurate (after many attempts with emcee, MultiNest and dynesty).
With my dynesty set up, it finds the solution with half as many iterations (100000 versus 200000). I'll see what happens with 800 live points.
Either way, the main problem is solved, thanks!
I just had a run crash with the following error:
This is the last big of verbose output:
The analysis is probably quite hard to reproduce locally as it uses an annoying to install c++ library wrapping a Python code base.
I suspect I can fix this by altering settings (e.g. more live points or bound settings) but I figured you'd want to be aware of a potential run breaking error that can come up in case there is a try / except fix.