Questions Regarding Sampler Performance

Jammy2211 commented 1 year ago

I have been applying Nautilus to strong lens modeling (https://github.com/Jammy2211/PyAutoLens), a problem which I have previously found that dynesty (using random walk sampler) can give very good performance in terms of efficiency.

My first trial runs with Nautilus were promising -- it could converge on the same solution as dynesty with ~x2.5 fewer samples (I suspect x3.0 or better will be possible with tuned nautilus settings and more complex models than my test cases).

However, I have struggled to make Nautilus's wallclock time go below than of dynesty's, even when the number of iterations are reduced.

Here are statistics from two examples of the same strong lens model-fit parallelized over N=28 CPUs:

Dynesty:

Total Samples = 161938
Total Accepted Samples = 15354
Acceptance Ratio = 0.0948140646420235
Time To Run = 2:19:02.749062
Time Per Sample (seconds) = 0.05151816782709724
Log Likelihood Function Evaluation Time (seconds) = 1.0385229587554932

Nautilus

Total Samples = 68700
Total Accepted Samples = 68700
Acceptance Ratio = 1.0
Time To Run = 2:21:45.691793
Time Per Sample (seconds) = 0.1238091964110884
Log Likelihood Function Evaluation Time (seconds) = 1.1223814487457275

dynesty converges in the same amount of time as Nautilus, even though it takes over 2.5x more samples (161938 vs 68700).

I believe this is because of the overheads Nautilus has associated with the neural network.

I assumed these overheads would be neglible for a likelihood function which takes over 1 second to compute, but it does not seem to be the case. I see similar behavior for even longer likelihood functions (3 seconds +).

My questions currently are:

It surprises me that the NN overheads would be so significant for likelihood function evaluation times as long as 3 seconds. Is it possible that the parallelization of the likelihood function is significantly more efficient than parallelization of the neural network sample generator? I am parallelizing over N=28 CPU, and this would help explain why that even for long likelihood function evaluation times the overheads still seem to drive run-time.
If I am in a "overhead dominated regime" do you have any suggestions for settings changes? Reducing n_batch appears to have helped and I am now experimenting with parameters like n_networks.
I am using a low n_live of 200 (which based on my experiments led to convergence with the fewest iterations). Would increasing n_live reduce the number of neural network calls?

johannesulf commented 1 year ago

Thanks so much for running these tests. The numbers are definitely very surprising. I don't know your exact configuration, i.e., number of dimensions, difficulty of the likelihood etc., but I wouldn't expect the overhead to be that large. Could you share some example code so I can investigate this in detail?

Jammy2211 commented 1 year ago

I have made an example at the following repo:

https://github.com/Jammy2211/autolens_workspace_test/blob/main/naut/naut.py

I run this locally by going to the main repo folder /mnt/c/Users/Jammy/Code/PyAuto/autolens_workspace_test and running python naut/naut.py 4.

Here, 4 is the number of CPUs and the script is currently set up for multiprocessing (MPI can be turned on by commenting out a few bits of code at the end).

The model and Fitness objects that wrap the prior and likelihood inputs may seem a bit odd -- we have them set up as part of a library we developed for model-fitting. I am confident they function as expected in terms of wrapping Nautilus.

You will need to install our lens modeling software autolens in order to run the example https://pyautolens.readthedocs.io/en/latest/installation/pip.html

This is the same data and model (and therefore likelihood function) as the runs I describe above so should be representative. Note that in order for it to run in 2:21:45.691793 overall (so just over 2 hours) the run was parallelized over 28 CPUs. Run times may therefore be quite long if you have fewer CPUs (e.g. testing the run on a laptop).

Let me know if theres an easier way I could set things up for you!

Jammy2211 commented 1 year ago

One thing worth noting, that could be an issue, is the following:

    @property
    def resample_figure_of_merit(self):
        """
        If a sample raises a FitException, this value is returned to signify that the point requires resampling or
        should be given a likelihood so low that it is discard.

        -np.inf is an invalid sample value for Nautilus, so we instead use a large negative number.
        """
        return -1.0e99

Basically, we have situations where the likelihood function returns an np.nan, np.inf or raises an exception.

Currently the way I deal with that returning a value of -1.0e99, as I couldn't see an alternative (e.g. asking Nautilus to discard the sample). I guess its conceivable these large negative value, which are typically only drawn during the beginning of sampling and significant change the likelihood surface from smoothness could be causing the neural networks to have issues.

Happy to do a test where I change this to a different behaviour, if you have a suggestion for what I should do.

johannesulf commented 1 year ago

Thanks! This is very helpful, and I'll look more into this soon. In the meantime, I have a suspicion of what the issue could be. Are you providing your own pool, or do you initialize the sampler with Sampler(..., pool=28)? In the latter case, for each likelihood evaluation, the likelihood function will be pickled and passed to the worker for each likelihood evaluation. If your likelihood function is complex, this I/O could be the bottleneck. Dynesty provides its own pool to circumvent this issue, and I believe PyAutoFit is using that. I could make nautilus behave similarly automatically.

Jammy2211 commented 1 year ago

I have been thinking for the past week that the problem may simply be that dynesty's parallelization scales better the nautilous for my use case (e.g. LH function pickling)!

Dynesty provides its own pool to circumvent this issue, and I believe PyAutoFit is using that. I could make nautilus behave similarly automatically.

Yes, when we updated to the new dynesty with this pool we saw a notable speed up owing to it reducing pool overheads.

I have done 2 parallelization tests with nautilus:

multiprocessing where Sampler(..., pool=28).
MPI where Sampler(..., MPIPoolExecutor(28))

MPI gave much better performance than multiprocessing, but may still be slower than dynesty if the pickling overheads are a problem.

My problem has been that when profiling runs I don't have information on how long the neural network overheads take compared to the likelihood evaluations, so when there is slow down I cannot decouple if its associated with neural network overheads or inefficient parallelization.

Jammy2211 commented 1 year ago

However, I did some 1 CPU tests with serial runs with a lens model whose likelihood function evaluation time was ~0.15 seconds.

I found that:

dynesty converged in ~10 hours in 300000 samples.
Nautilus runs did not converge after 36 hours (scripts were terminated due to hpc limit being reached). Inspection of the nautilus output suggests it made ~ 200000 samples in 36 hours.

With a shorter LH function evaluation time the overheads will contribute more but these numbers indiciate that the majority of the 36 hours were not spent evaluating likelihoods, which seems unlikely?

johannesulf commented 1 year ago

I think pickling overheads may be the culprit here. Can you try out the pool branch and benchmark the Sampler(..., pool=28) case again? In this case, the nautilus version in the pool branch will automatically initialize the workers to prevent expensive repeated pickling.

Can you tell me a bit more about the test with 1 CPU? What's the number of live points and dimensionality of the problem? Are any other initialization parameters of the sampler non-standard?

Jammy2211 commented 1 year ago

I will test the pool branch tonight, would pickling still be a culprit for MPI parallelization? And does the pool branch support MPI or only multiprocessing currently? (So I know which method to test).

Jammy2211 commented 1 year ago

Can you tell me a bit more about the test with 1 CPU? What's the number of live points and dimensionality of the problem? Are any other initialization parameters of the sampler non-standard?

Here are the nautilus settings:

{
    "type": "autofit.non_linear.search.nest.nautilus.search.Nautilus",
    "arguments": {
        "name": "source_lp[1]_light[lp]_mass[total]_source[lp]",
        "number_of_cores": 1,
        "iterations_per_update": 5000000,
        "n_like_new_bound": null,
        "unique_tag": "slacs0008-0004",
        "mpi": false,
        "n_live": 200,
        "seed": null,
        "split_threshold": 100,
        "n_networks": 4,
        "n_points_min": null,
        "n_eff": 500,
        "n_shell": null,
        "n_update": null,
        "path_prefix": {
            "type": "pathlib.PosixPath",
            "arguments": {}
        },
        "enlarge_per_dim": 1.1
    }
}

The model had N=23 parameters, I don't think there is anything "non standard" except the -1.0e99 thing I described above. It is a fairly standard model one would use for lens modeling.

Jammy2211 commented 1 year ago

I will also perform a 1 CPU run and put in timer statements on how long the neural network training is taking (provided I can figure out the nautilus source code to do this, looks fairly self explanatory).

I will also try and extend this to the 28 CPU runs.

johannesulf commented 1 year ago

The changes in the pool branch wouldn't apply to the MPI-version in the way you posted it here. If the user provide their own pool (MPIPoolExecutor(28) in this case), then they need to ensure it's initialized properly.

More generally, overheads of multiple hours seem excessive for problems with low dimensionality (~25 dimensions or less). Neural network training or any other part of the algorithm other than the likelihood evaluations shouldn't take that long.

johannesulf commented 1 year ago

I performed some tests and think pickling is most likely the issue for the original benchmarks posted here. Using 4 CPU cores and the pool branch, likelihood evaluations (including the overhead for pickling) are around two times faster. The improvement for 28 cores is probably the same or larger. So that would make nautilus and dynesty take roughly the same time per likelihood call.

I'll probably integrate the pool branch into main, eventually.

Jammy2211 commented 1 year ago

Yep, that did it, the performance is incredibly fast now!

I'll leave this open for now, as I may have a few followup questions about pool, but want to get some profiling tests run first.

johannesulf commented 1 year ago

Wonderful! Also, note that nautilus computes likelihood values in batches (n_batch keyword argument). For optimal performance, the batch size should be a multiple of the number of CPU cores or MPI processes, e.g., 4 * 28 = 112 if using 28 CPU cores.

Jammy2211 commented 1 year ago

Yep, I already have the n_batch being set in that way, and saw a small speed-up compared to other runs.

My final question is about whether you think it is feasible to get more efficient parallelization and how.

In my runs, I define a "speed up factor", which is basically the average likelihood evaluation time multiplied by the total number of iterations divided by the actual wall clock time. I am running on 56 CPUs (one node on the HPC), so if parallelization were 100% efficient I would see a speed up factor of 56.

The speed up factor varies for different fits, where the higher values are ~40 and lowest values are ~15. This appears to scale with the amount of extra data I store in my likelihood function (and therefore the amount of extra pickling necessary). However, I am yet to explicitly test this and will do so in the next few days.

My questions are:

1) Do you have any advise for achieving a better speed up factor via multiprocessing? Are there more customized ways I can set up the pickling or can I store the addition data in my likelihood function is a better format? Or is this a fundamental limitation of Python multipcoessing?

2) Do you expect that a properly initialized MPI implementation would see better performance and do you know any examples online of how to implement MPI in this way (I've never found a good resource explaining this in the context of model-fitting via a sampler!).

Nevertheless, nautilus is now giving us a ~x3 speed up over dynesty and appears to be giving more robust sampling performance as well, so I am pretty satisfied already I must say! :)

johannesulf commented 1 year ago

With the current implementation in the main branch and calling via Sampler(..., pool=28), the overhead from pickling should be basically negligible. I don't know why the speed-up factors are so low sometimes. I don't think the overhead from nautilus should be the reason. But I can't 100% rule this out. There could also be other reasons. For example, if the likelihood computation time is highly variable, this could lead to 55 processors often waiting for the result of one computation that just happens to take longer. Is this still for the example discussed earlier? I'd be happy to explore this further.
I don't think that MPI would give a better performance compared to multiprocessing (assuming you ran on one node). They should perform the same if both are appropriately initialized. You can have a look at how this is currently implemented in nautilus (https://github.com/johannesulf/nautilus/blob/main/nautilus/sampler.py#L309C27-L310C65) to get some idea of one possible approach. This should work with MPIPoolExecutor, too.

Jammy2211 commented 1 year ago

With the current implementation in the main branch and calling via Sampler(..., pool=28), the overhead from pickling should be basically negligible.

I agree, I did some reading / tests last night and didn't realise pickling occurs only once before the fit starts. So yes, it is neglible.

I don't know why the speed-up factors are so low sometimes. I don't think the overhead from nautilus should be the reason. But I can't 100% rule this out.

The speed-up factors I get are roughly the same for nautilus and dynesty, so I think it is more general behaviour of multiprocessing that I do not yet understand.

For example, if the likelihood computation time is highly variable, this could lead to 55 processors often waiting for the result of one computation that just happens to take longer. Is this still for the example discussed earlier? I'd be happy to explore this further.

I am considering this, in general our LH function is stable over time but there may be some unexpected behavior I have not considered. I will do some tests of this to rule it out.

Is this still for the example discussed earlier? I'd be happy to explore this further.

For the example above the speed-up times are more around 30 over 56 CPUs, so seemingly scaling a lot better. It is a slightly different lens model which has the worst performance, which may point to LH function time variations. I'll dig a bit deeper first.

Jammy2211 commented 1 year ago

After some profiling, indeed our slow down is primarily due to variations in the log likelihood evaluation time:

The run above gives a speed up of ~28 for 56 CPUs, and shows LH evaluation times that span 2-4 seconds.

Any suggestions? I assume a nautilus asynchronous mapping feature is... out of scope?

johannesulf commented 1 year ago

Thanks for the profiling! Yes, these differences may contribute to the problem of lower-than-expected speed-ups. With the current version of nautilus, my main suggestion would be to increase the batch size, n_batch. As you increase the batch size, these variations should average out more to some degree. However, increasing the batch size also tends to increase the number of points in every bound/shell since points are always added in batches. Check how many points are added between creating new bounds. That's roughly how large your batch size can be without strongly affecting how nautilus runs otherwise. (Going larger should not cause any bias. But it may result in nautilus needing more samples overall.)

And yes, asynchronous mapping is possible, in general. But it's much more challenging to implement. Also, it would break the reproducibility when providing a random seed since the points sampled now depend on execution times which will naturally vary.

Jammy2211 commented 1 year ago

Thanks, we will experiment with n_batch and also assess if we can homogenize our LH function run-time.

Would asynchronous mapping require an upheaval of the nautilus code in a significant way, or do you think it would just require someone with knowledge of parallelization to adapt the map function that is currently there? I may be able to put someone with some knowledge on the problem if it sounds feasible.

johannesulf commented 1 year ago

Thanks for the offer! Sadly, the issue is solely related to how to implement this in the nautilus algorithm, not how to perform the parallelization itself. Also, breaking seeding is something I'm not currently considering. Ultimately, such an asynchronous mapping feature is currently out of scope.

I'm closing this issue here since the original issue had to do with I/O limitations when performing parallelization. That was addressed with PR #25 and by updating the docs.

johannesulf / nautilus

Questions Regarding Sampler Performance #24