lukas-weber / Carlo.jl

Monte Carlo framework that provides MPI parallelization, checkpointing and statistical postprocessing in an algorithm-agnostic way.
MIT License
28 stars 2 forks source link

Feature Request: MPI support for sequential tasks #11

Closed hz-xiaxz closed 1 month ago

hz-xiaxz commented 1 month ago

Hello! Thanks for your robust and nice-documented package! In Variational Monte Carlo tasks one needs to perform Monte Carlo calculation in sequence. That's because the variational parameter of the subsequent task should relate to the MC results from the previous one. To employ this, I hacked the job file as follow:

for _ in 1:SRsteps
    tm = TaskMaker()
    # set tm paras here
    tm.g = g
    task(tm)

    dir = @__DIR__
    savepath = dir * "/../data/" * process_time *
               "/$(tm.nx)x$(tm.ny)g=$(tm.g)"
    job = JobInfo(
        savepath,
        FastFermionSampling.MC;
        tasks = make_tasks(tm),
        checkpoint_time = "30:00",
        run_time = "24:00:00"
    )

    with_logger(Carlo.default_logger()) do
        start(Carlo.SingleScheduler, job)
        # start(Carlo.MPIScheduler, job)
    end 

    update_g()
end

It runs all well with SingleScheduler, but with MPIScheduler it throws errors like Stacktrace:running in parallel run mode but measure(::MC, ::MCContext, ::MPI.Comm) not implemented. I wonder if MPI in this job script is simply Not Supported or can be done with the configuration of MPI.comm? I still use the mpirun -n 96 julia ./job.jl command line script to run this job.

To put it more further, will it be more elegant to add this feature into the JobTools module? Like letting tasks=make_tasks(tm) have different parallel and sequential modes? I'm not quite sure if it is easy to be done.

lukas-weber commented 1 month ago

Hi,

Basically this should work. But what you have run into for some reason with that error is Parallel run mode. Usually that should not happen if you use MPIScheduler without setting the ranks_per_run option, I’ll investigate it later.

But Parallel run mode would enable you to do something similar: you can MPI-parallelize your own section of the code. That is why it asks you to implement this version of measure and other methods, which get a communicator. Maybe an elegant way to do this in the first place is to use parallel run mode to put that whole update loop into your code rather than in the jobfile.

function Carlo.sweep!(mc::MC, ctx::MCContext, comm::MPI.Comm)
    sample_gradients_in_parallel!(mc, ctx, comm)
    if time_to_update_parameters()
        update_parameters!(mc, comm)
    end
end

The downside is that you have to write some MPI code yourself, but the result will be faster because you don’t have to write everything to disk every time.

hz-xiaxz commented 1 month ago

Thanks for your prompt reply! I'll further check the Parallel run mode out. One more question here, do you mean that I should wrap the updateConfiguration (the old sweep!(mc, ctx) function), the measure! functions into the new sweep!(mc,ctx,comm) function? And thus new sweep!(mc,ctx,comm) will only update the parameters in a sweep?

lukas-weber commented 1 month ago

The new sweep function has to do everything the old one does and then some manual communication to exchange the data for the gradients before updating the parameters and syncing them across the workers. (Probably you can get away with an MPI.gather and MPI.broadcast)

Instead of using the Carlo measurements for accumulating the gradients, you would have to average them manually.

-------- Original Message -------- On 7/28/24 04:00, LeoXia wrote:

Thanks for your prompt reply! I'll further check the Parallel run mode out. One more question here, do you mean that I should wrap the updateConfiguration (the old sweep!(mc, ctx) function), the measure! functions into the new sweep!(mc,ctx,comm) function? And thus new sweep!(mc,ctx,comm) will only update the parameters in a sweep?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

hz-xiaxz commented 1 month ago

Got it, thanks for you help! And I think this issue can be closed.