Closed Jammy2211 closed 1 year ago
Thanks so much for running these tests. The numbers are definitely very surprising. I don't know your exact configuration, i.e., number of dimensions, difficulty of the likelihood etc., but I wouldn't expect the overhead to be that large. Could you share some example code so I can investigate this in detail?
I have made an example at the following repo:
https://github.com/Jammy2211/autolens_workspace_test/blob/main/naut/naut.py
I run this locally by going to the main repo folder /mnt/c/Users/Jammy/Code/PyAuto/autolens_workspace_test
and running python naut/naut.py 4
.
Here, 4
is the number of CPUs and the script is currently set up for multiprocessing (MPI can be turned on by commenting out a few bits of code at the end).
The model
and Fitness
objects that wrap the prior
and likelihood
inputs may seem a bit odd -- we have them set up as part of a library we developed for model-fitting. I am confident they function as expected in terms of wrapping Nautilus
.
You will need to install our lens modeling software autolens
in order to run the example https://pyautolens.readthedocs.io/en/latest/installation/pip.html
This is the same data and model (and therefore likelihood function) as the runs I describe above so should be representative. Note that in order for it to run in 2:21:45.691793
overall (so just over 2 hours) the run was parallelized over 28 CPUs. Run times may therefore be quite long if you have fewer CPUs (e.g. testing the run on a laptop).
Let me know if theres an easier way I could set things up for you!
One thing worth noting, that could be an issue, is the following:
@property
def resample_figure_of_merit(self):
"""
If a sample raises a FitException, this value is returned to signify that the point requires resampling or
should be given a likelihood so low that it is discard.
-np.inf is an invalid sample value for Nautilus, so we instead use a large negative number.
"""
return -1.0e99
Basically, we have situations where the likelihood function returns an np.nan
, np.inf
or raises an exception.
Currently the way I deal with that returning a value of -1.0e99
, as I couldn't see an alternative (e.g. asking Nautilus
to discard the sample). I guess its conceivable these large negative value, which are typically only drawn during the beginning of sampling and significant change the likelihood surface from smoothness could be causing the neural networks to have issues.
Happy to do a test where I change this to a different behaviour, if you have a suggestion for what I should do.
Thanks! This is very helpful, and I'll look more into this soon. In the meantime, I have a suspicion of what the issue could be. Are you providing your own pool, or do you initialize the sampler with Sampler(..., pool=28)
? In the latter case, for each likelihood evaluation, the likelihood function will be pickled and passed to the worker for each likelihood evaluation. If your likelihood function is complex, this I/O could be the bottleneck. Dynesty provides its own pool to circumvent this issue, and I believe PyAutoFit is using that. I could make nautilus behave similarly automatically.
I have been thinking for the past week that the problem may simply be that dynesty's parallelization scales better the nautilous for my use case (e.g. LH function pickling)!
Dynesty provides its own pool to circumvent this issue, and I believe PyAutoFit is using that. I could make nautilus behave similarly automatically.
Yes, when we updated to the new dynesty with this pool we saw a notable speed up owing to it reducing pool overheads.
I have done 2 parallelization tests with nautilus
:
Sampler(..., pool=28)
.Sampler(..., MPIPoolExecutor(28))
MPI gave much better performance than multiprocessing, but may still be slower than dynesty if the pickling overheads are a problem.
My problem has been that when profiling runs I don't have information on how long the neural network overheads take compared to the likelihood evaluations, so when there is slow down I cannot decouple if its associated with neural network overheads or inefficient parallelization.
However, I did some 1 CPU tests with serial runs with a lens model whose likelihood function evaluation time was ~0.15 seconds.
I found that:
dynesty
converged in ~10 hours in 300000 samples. Nautilus
runs did not converge after 36 hours (scripts were terminated due to hpc limit being reached). Inspection of the nautilus
output suggests it made ~ 200000 samples in 36 hours.With a shorter LH function evaluation time the overheads will contribute more but these numbers indiciate that the majority of the 36 hours were not spent evaluating likelihoods, which seems unlikely?
I think pickling overheads may be the culprit here. Can you try out the pool
branch and benchmark the Sampler(..., pool=28)
case again? In this case, the nautilus version in the pool
branch will automatically initialize the workers to prevent expensive repeated pickling.
Can you tell me a bit more about the test with 1 CPU? What's the number of live points and dimensionality of the problem? Are any other initialization parameters of the sampler non-standard?
I will test the pool
branch tonight, would pickling still be a culprit for MPI parallelization? And does the pool
branch support MPI or only multiprocessing currently? (So I know which method to test).
Can you tell me a bit more about the test with 1 CPU? What's the number of live points and dimensionality of the problem? Are any other initialization parameters of the sampler non-standard?
Here are the nautilus
settings:
{
"type": "autofit.non_linear.search.nest.nautilus.search.Nautilus",
"arguments": {
"name": "source_lp[1]_light[lp]_mass[total]_source[lp]",
"number_of_cores": 1,
"iterations_per_update": 5000000,
"n_like_new_bound": null,
"unique_tag": "slacs0008-0004",
"mpi": false,
"n_live": 200,
"seed": null,
"split_threshold": 100,
"n_networks": 4,
"n_points_min": null,
"n_eff": 500,
"n_shell": null,
"n_update": null,
"path_prefix": {
"type": "pathlib.PosixPath",
"arguments": {}
},
"enlarge_per_dim": 1.1
}
}
The model had N=23 parameters, I don't think there is anything "non standard" except the -1.0e99
thing I described above. It is a fairly standard model one would use for lens modeling.
I will also perform a 1 CPU run and put in timer statements on how long the neural network training is taking (provided I can figure out the nautilus
source code to do this, looks fairly self explanatory).
I will also try and extend this to the 28 CPU runs.
The changes in the pool
branch wouldn't apply to the MPI-version in the way you posted it here. If the user provide their own pool (MPIPoolExecutor(28)
in this case), then they need to ensure it's initialized properly.
More generally, overheads of multiple hours seem excessive for problems with low dimensionality (~25 dimensions or less). Neural network training or any other part of the algorithm other than the likelihood evaluations shouldn't take that long.
I performed some tests and think pickling is most likely the issue for the original benchmarks posted here. Using 4 CPU cores and the pool
branch, likelihood evaluations (including the overhead for pickling) are around two times faster. The improvement for 28 cores is probably the same or larger. So that would make nautilus and dynesty take roughly the same time per likelihood call.
I'll probably integrate the pool
branch into main
, eventually.
Yep, that did it, the performance is incredibly fast now!
I'll leave this open for now, as I may have a few followup questions about pool
, but want to get some profiling tests run first.
Wonderful! Also, note that nautilus computes likelihood values in batches (n_batch
keyword argument). For optimal performance, the batch size should be a multiple of the number of CPU cores or MPI processes, e.g., 4 * 28 = 112 if using 28 CPU cores.
Yep, I already have the n_batch
being set in that way, and saw a small speed-up compared to other runs.
My final question is about whether you think it is feasible to get more efficient parallelization and how.
In my runs, I define a "speed up factor", which is basically the average likelihood evaluation time multiplied by the total number of iterations divided by the actual wall clock time. I am running on 56 CPUs (one node on the HPC), so if parallelization were 100% efficient I would see a speed up factor of 56.
The speed up factor varies for different fits, where the higher values are ~40 and lowest values are ~15. This appears to scale with the amount of extra data I store in my likelihood function (and therefore the amount of extra pickling necessary). However, I am yet to explicitly test this and will do so in the next few days.
My questions are:
1) Do you have any advise for achieving a better speed up factor via multiprocessing
? Are there more customized ways I can set up the pickling or can I store the addition data in my likelihood function is a better format? Or is this a fundamental limitation of Python multipcoessing
?
2) Do you expect that a properly initialized MPI implementation would see better performance and do you know any examples online of how to implement MPI in this way (I've never found a good resource explaining this in the context of model-fitting via a sampler!).
Nevertheless, nautilus
is now giving us a ~x3 speed up over dynesty
and appears to be giving more robust sampling performance as well, so I am pretty satisfied already I must say! :)
main
branch and calling via Sampler(..., pool=28)
, the overhead from pickling should be basically negligible. I don't know why the speed-up factors are so low sometimes. I don't think the overhead from nautilus should be the reason. But I can't 100% rule this out. There could also be other reasons. For example, if the likelihood computation time is highly variable, this could lead to 55 processors often waiting for the result of one computation that just happens to take longer. Is this still for the example discussed earlier? I'd be happy to explore this further.MPIPoolExecutor
, too.With the current implementation in the main branch and calling via Sampler(..., pool=28), the overhead from pickling should be basically negligible.
I agree, I did some reading / tests last night and didn't realise pickling occurs only once before the fit starts. So yes, it is neglible.
I don't know why the speed-up factors are so low sometimes. I don't think the overhead from nautilus should be the reason. But I can't 100% rule this out.
The speed-up factors I get are roughly the same for nautilus and dynesty, so I think it is more general behaviour of multiprocessing
that I do not yet understand.
For example, if the likelihood computation time is highly variable, this could lead to 55 processors often waiting for the result of one computation that just happens to take longer. Is this still for the example discussed earlier? I'd be happy to explore this further.
I am considering this, in general our LH function is stable over time but there may be some unexpected behavior I have not considered. I will do some tests of this to rule it out.
Is this still for the example discussed earlier? I'd be happy to explore this further.
For the example above the speed-up times are more around 30 over 56 CPUs, so seemingly scaling a lot better. It is a slightly different lens model which has the worst performance, which may point to LH function time variations. I'll dig a bit deeper first.
After some profiling, indeed our slow down is primarily due to variations in the log likelihood evaluation time:
The run above gives a speed up of ~28 for 56 CPUs, and shows LH evaluation times that span 2-4 seconds.
Any suggestions? I assume a nautilus
asynchronous mapping feature is... out of scope?
Thanks for the profiling! Yes, these differences may contribute to the problem of lower-than-expected speed-ups. With the current version of nautilus, my main suggestion would be to increase the batch size, n_batch
. As you increase the batch size, these variations should average out more to some degree. However, increasing the batch size also tends to increase the number of points in every bound/shell since points are always added in batches. Check how many points are added between creating new bounds. That's roughly how large your batch size can be without strongly affecting how nautilus runs otherwise. (Going larger should not cause any bias. But it may result in nautilus needing more samples overall.)
And yes, asynchronous mapping is possible, in general. But it's much more challenging to implement. Also, it would break the reproducibility when providing a random seed since the points sampled now depend on execution times which will naturally vary.
Thanks, we will experiment with n_batch
and also assess if we can homogenize our LH function run-time.
Would asynchronous mapping require an upheaval of the nautilus
code in a significant way, or do you think it would just require someone with knowledge of parallelization to adapt the map
function that is currently there? I may be able to put someone with some knowledge on the problem if it sounds feasible.
Thanks for the offer! Sadly, the issue is solely related to how to implement this in the nautilus algorithm, not how to perform the parallelization itself. Also, breaking seeding is something I'm not currently considering. Ultimately, such an asynchronous mapping feature is currently out of scope.
I'm closing this issue here since the original issue had to do with I/O limitations when performing parallelization. That was addressed with PR #25 and by updating the docs.
I have been applying
Nautilus
to strong lens modeling (https://github.com/Jammy2211/PyAutoLens), a problem which I have previously found that dynesty (using random walk sampler) can give very good performance in terms of efficiency.My first trial runs with
Nautilus
were promising -- it could converge on the same solution asdynesty
with ~x2.5 fewer samples (I suspect x3.0 or better will be possible with tunednautilus
settings and more complex models than my test cases).However, I have struggled to make
Nautilus
's wallclock time go below than ofdynesty
's, even when the number of iterations are reduced.Here are statistics from two examples of the same strong lens model-fit parallelized over N=28 CPUs:
Dynesty:
Nautilus
dynesty
converges in the same amount of time asNautilus
, even though it takes over 2.5x more samples (161938 vs 68700).I believe this is because of the overheads
Nautilus
has associated with the neural network.I assumed these overheads would be neglible for a likelihood function which takes over 1 second to compute, but it does not seem to be the case. I see similar behavior for even longer likelihood functions (3 seconds +).
My questions currently are:
It surprises me that the NN overheads would be so significant for likelihood function evaluation times as long as 3 seconds. Is it possible that the parallelization of the likelihood function is significantly more efficient than parallelization of the neural network sample generator? I am parallelizing over N=28 CPU, and this would help explain why that even for long likelihood function evaluation times the overheads still seem to drive run-time.
If I am in a "overhead dominated regime" do you have any suggestions for settings changes? Reducing
n_batch
appears to have helped and I am now experimenting with parameters liken_networks
.I am using a low
n_live
of 200 (which based on my experiments led to convergence with the fewest iterations). Would increasingn_live
reduce the number of neural network calls?