Using `diffeqtorch` with `julia` for `lotka_volterra` on `slurm` ?

JuliaLinhart commented 1 year ago

Hello! I am trying to use your benchmarking framework for some experiments and wanted to include the lotka_volterra example. I've managed to install julia and diffeqtorch as specified in your Readme.md, however this becomes a bit of a pain if I want to run my experiments on slurm.

I am using submitit to run jobs on slurm. Everytime I want to run an experiment for lotka_volterra, I have to recreate an image to use diffeqtorch with julia within the slurm_setup as follows:

executor = submitit.AutoExecutor(job_name)
executor.update_parameters(
        timeout_min=180,
        slurm_job_name=job_name,
        slurm_time=f"{timeout_hour}:00:00",
        slurm_additional_parameters={
            "ntasks": 1,
            "cpus-per-task": n_cpus,
            "distribution": "block:block",
        },
        slurm_setup=[
            "module purge",
            "export JULIA_SYSIMAGE_DIFFEQTORCH='$HOME/.julia_sysimage_diffeqtorch.so'",
            "python -c 'from diffeqtorch.install import install_and_test; install_and_test()'",
        ],
)

This takes forever! I was wondering if you use slurm for your experiments and if yes, how?

More generally, sampling from the posterior is very time expensive. How do you handle this constraint? (I am talking about sampling for the lotka_volterra but also for slcp posteriors...)

jan-matthis commented 1 year ago

Hi! We didn't use Hydra's Submitit Launcher but rather the RQ Launcher that I originally wrote for this purpose, so no direct experience with this. Is each worker building the sysimage from scratch? If so, perhaps you can do the build just once and distribute it to your workers.

How long does sampling take? I'd expect things to very roughly line up with the runtimes published as part of the benchmark.

JuliaLinhart commented 1 year ago

Oh nice okay! I'll have a look at that! And yes, i'm distributing the built image across workers, but it still needs to be built every time I launch a new experiment.

And for the posterior sampling, I think it's way faster for you guys! I'll have to take a closer look at your code!

Anyway, thanks so much for the response! I'll let you what I end up doing!

JuliaLinhart commented 1 year ago

If I understood well, you are loading precomputed samples from the reference posterior here and the runtime as defined here corresponds to the runtime of the inference algorithm (as computed here).

However, I was talking about the runtime related to the sampling of the reference posterior. Indeed, I need to sample several times from the reference posterior for empirical results. This means that I cannot just load your precomputed samples. Do you have a solution for faster reference posterior sampling?

jan-matthis commented 1 year ago

And yes, i'm distributing the built image across workers, but it still needs to be built every time I launch a new experiment.

Hm, if the image is on the worker and gets detected properly, a build shouldn't be needed. Unfortunately, not sure what is going wrong here.

If I understood well, you are loading precomputed samples from the reference posterior here and the runtime as defined here corresponds to the runtime of the inference algorithm (as computed here).

However, I was talking about the runtime related to the sampling of the reference posterior. Indeed, I need to sample several times from the reference posterior for empirical results. This means that I cannot just load your precomputed samples. Do you have a solution for faster reference posterior sampling?

So, yes, for all tasks, 10 observations with 10k reference posterior samples each are stored in this repo. We ran algorithms on those observations and used the corresponding reference posterior samples to compute metrics, so that we only had to compute references once.

In case you don't want to reuse the references, e.g., because you want to run on additional or different observations, you'll need to generate new ones. Unfortunately, this is slow. There is an option to parallelize across observations.

JuliaLinhart commented 1 year ago

Hm, if the image is on the worker and gets detected properly, a build shouldn't be needed. Unfortunately, not sure what is going wrong here.

D'ont worry about that. I'll find a solution :)

In case you don't want to reuse the references, e.g., because you want to run on additional or different observations, you'll need to generate new ones. Unfortunately, this is slow. There is an option to parallelize across observations.

Yes exactly, that's what I thought! Thanks for the parallelization option, that's already a big help!

jan-matthis commented 1 year ago

Glad it's helpful! Closing this issue -- feel free to reopen if you have follow-up questions.

sbi-benchmark / sbibm

Using `diffeqtorch` with `julia` for `lotka_volterra` on `slurm` ? #62