Option for adapting in serial

jwallwork23 commented 5 months ago

Closes #71.

jwallwork23 commented 5 months ago

Good spot, thanks.

ddundo commented 4 months ago

One worrying thing is that checkpointing is very very slow. For example, I ran the burgers demo in serial with two subintervals and checkpointed the final solution of each subinterval - so I'd say it's representative of checkpointing the mesh and the metric at each subinterval, as in this PR. I am attaching the flame graph here burgers2.zip (zipped up since github makes the svg uninteractive).

burgers2

So it looks like checkpointing is more than twice as expensive as solve_forward. And similarly for my glacier experiment where I have 100 subintervals (each with 50 timesteps). I also ran the burgers demo with a single subinterval in both serial and parallel (possible since it only has 1 subinterval so there are no mesh-to-mesh projections happening). There is no speed-up in checkpointing, but there is in solve_forward.

So if we have to checkpoint at every iteration (we'd have to checkpoint twice in the first iteration and once in latter ones), this might cancel out a lot of the benefits of solving in parallel.

Have you maybe tried this @jwallwork23 or @stephankramer? Any chance there is an alternative?

jwallwork23 commented 4 months ago

To be honest, I haven't done these kinds of investigations yet so this is useful information, thanks @ddundo.

jwallwork23 commented 4 months ago

@ddundo This is almost there actually. I need to tidy things up and fix one issue: stashing the metric parameters in the CheckpointFile. Is this something you've tried before by any chance?

ddundo commented 4 months ago

Thanks @jwallwork23! ~~Yup, here's an example of how I did it - messy but it works :)~~

I now found a better way:

mp = {
    "dm_plex_metric_target_complexity": 1234.,
    "dm_plex_metric_h_min": 1.,
    "dm_plex_metric_h_max": 50e3,
    "dm_plex_metric_p": np.inf,
    "dm_plex_metric_a_max": 1e5,
}

with CheckpointFile('test.h5', 'w') as afile:
    afile._write_pickled_dict("metric_parameters", "mp_dict", mp)

# then load like this
with CheckpointFile('test.h5', 'r') as afile:
    loaded_mp = afile._read_pickled_dict("metric_parameters", "mp_dict")

print(loaded_mp)  # should be the same as mp

jwallwork23 commented 4 months ago

This works locally but annoyingly the GitHub Actions run gets stuck on the first test that uses subprocess. I wonder if it has a problem with it.

stephankramer commented 4 months ago

Out of interest why do we need subprocess to do this? i.e. why don't we just do the whole "read checkpoint, adapt, write checkpoint" in the same process on rank 0?

ddundo commented 4 months ago

Out of interest why do we need subprocess to do this? i.e. why don't we just do the whole "read checkpoint, adapt, write checkpoint" in the same process on rank 0?

Joe mentioned subprocess before I commented this, but I now went to quantify it and realised I was loading it wrong in the case that Stephan is referring to. They actually took exactly the same time (1 sec difference). Sorry about this!

So with subprocess we can do:


# in the main script
if rank == 0:
subprocess.run(["mpirun", "-n", "1", "python", "adapt_script.py"])
COMM_WORLD.barrier()

in adapt_script.py:

with CheckpointFile('save_old.h5', 'r') as afile: loaded_mesh = afile.load_mesh("firedrake_default") loaded_metric = afile.load_function(loaded_mesh, "metric") loaded_mp = afile._read_pickled_dict("metric_parameters", "mp_dict")

rmetric = RiemannianMetric(loaded_metric.function_space()).assign(loaded_metric) rmetric.set_parameters(loaded_mp) adapted_mesh = adapt(loaded_mesh, rmetric)

with CheckpointFile('save_new.h5', 'w') as afile: afile.save_mesh(adapted_mesh)

- But in Stephan's method we have to pass `COMM_SELF` to several places:
```python
if rank == 0:
    with CheckpointFile('save_old.h5', 'r', comm=COMM_SELF) as afile:
        # load - same as above

    # metric... - same as above
    adapted_mesh = adapt(loaded_mesh, rmetric, comm=COMM_SELF)  # this is new

    with CheckpointFile('save_new.h5', 'w', comm=COMM_SELF) as bfile:
        bfile.save_mesh(adapted_mesh)
COMM_WORLD.barrier()

So we'd have to modify animate.adapt.adapt to allow comm to be passed, which is passed exactly as the name is passed now. So not hard to do! And probably a good idea in general. Maybe we could allow animate.adapt.adapt to take kwargs to be passed to firedrake.mesh.Mesh.

ddundo commented 4 months ago

@jwallwork23 if you'd rather not use private methods, we can also do

with CheckpointFile('test.h5', 'w') as afile:
    afile.save_function(metric, name="metric")
    for key, value in mp.items():
        afile.h5pyfile.create_dataset(f'metric_parameters/{key}', data=value)

(Essentially the same as the first way I suggested and then edited which used h5py but all in one)

jwallwork23 commented 4 months ago

Thanks both! That works locally and simplifies things a lot! Will tidy things up. Downside is that the CI still seems to hang, will need to investigate.

jwallwork23 commented 4 months ago

Now tidied up and fixed some oversights in the checkpointing. Ready for review, although the CI is still hanging as soon as it hits a test with the new functionality. Any thoughts?

ddundo commented 4 months ago

Regarding my flamegraph above... you made me realise what was wrong @jwallwork23! I am doing stuff on a server where I have the firedrake installation and my data on separate drives, and they apparently have very different writing speeds. So when I checkpoint into the drive with the firedrake installation (as your PR here does), it takes less than 1 second, and on the other drive it takes over 30 seconds.

So regarding my review comment https://github.com/pyroteus/animate/pull/72#discussion_r1517681102, I thought that keeping checkpoints would be useful since then we'd avoid repeatedly saving the same mesh. This is still true I guess.

So you might want to check how long checkpointing takes for you just in case, but I assume that it will not be slow! So ignore my comment above :) thanks for this!

jwallwork23 commented 3 months ago

@ddundo Thanks for these changes, although the tests no long pass locally for me. Do they work for you? If not, please could you fix them?

ddundo commented 3 months ago

@ddundo Thanks for these changes, although the tests no long pass locally for me. Do they work for you? If not, please could you fix them?

Sorry, thought you'd review the changes first :) Done in dabb9f68fcb66f72b761c10444410799d77a2268. I added a new unit test for custom mesh names which required me to slightly modify other tests too.

jwallwork23 commented 3 months ago

I'm afraid tests still fail locally for me.

ddundo commented 3 months ago

Sorry, I ignored the parallel tests! They should also work now :)

I'm not sure if it's relevant here but maybe it's insight into the CI getting stuck... In L214 I tried doing newmesh = chk.load_mesh(adaptor0.adapted_mesh.name) and pytest got stuck for me locally on the same test as the workflow here. This makes sense since adaptor0 is defined inside rank==0 and I was referring to it outside of it.

jwallwork23 commented 3 months ago

Thanks @ddundo.

Okay my recent debugging commits show that the rank 0 if condition works fine... but then it never leaves the if block. This isn't something I've seen before - let's discuss in the meeting.

ddundo commented 3 months ago

@jwallwork23 this sounds like an issue I had when doing this with subprocess. In the subprocess I'd write things into a text file and it got stuck in the rank 0 if condition because I forgot to close the file. I checked locally and it seems like chk indeed does stay open despite the with statement (I tried all checks here https://stackoverflow.com/questions/29863342/close-an-open-h5py-data-file and they all show that chk stays open).

Maybe we could try doing

chk = fchk.CheckpointFile(output_fname, "w", comm=COMM_SELF)
chk.save_mesh(adaptor0.adapted_mesh)
chk.close()

and perhaps even del chk, metric0, adaptor0 before leaving if rank 0.

I tried doing it myself by forking animate but it's extremely temperamental (https://github.com/ddundo/animate/pull/1)... even just adding a comment causes it to not print these debug messages you put in.

ddundo commented 3 months ago

@jwallwork23 I got it to work! Please see the successful commit at https://github.com/ddundo/animate/pull/1 I added time.sleep(30) for the rank 1 process

jwallwork23 commented 3 months ago

@jwallwork23 I got it to work! Please see the successful commit at ddundo#1 I added time.sleep(30) for the rank 1 process

Nice work! Although this is bizarre...

ddundo commented 3 months ago

Nice work! Although this is bizarre...

My theory is that the CI doesn't like having dead processes so we need to have the rank 1 process do something. Which I guess makes sense!

jwallwork23 commented 3 months ago

My theory is that the CI doesn't like having dead processes so we need to have the rank 1 process do something. Which I guess makes sense!

So https://github.com/pyroteus/animate/commit/b97ba84d5aca8ebb81eb15bdfe9274f67b89a23d worked. However, it seems very temperamental because https://github.com/pyroteus/animate/pull/72/commits/47ab149fdb42444260ace14037b2484d9921db1a is indentical but it hangs...

ddundo commented 3 months ago

Okay, I'm pretty sure I finally fixed it! When it seems to randomly hang, it's because an error was raised on one process. And this doesn't seem to kill the other ranks if there is a barrier afterwards. For example, try running this MFE with and without the barrier:

from mpi4py import MPI

if MPI.COMM_WORLD.rank == 0:
    raise ValueError("Test")
MPI.COMM_WORLD.barrier()  # it hangs forever if there is a barrier

I was googling and found only super complicated ways to broadcast an error from the process on which it was raised to all others. Do you or @stephankramer know of a simple way to do it? Otherwise, I think we should just get rid of all barriers, as I did here.

First barrier: when it got stuck without printing your debug messages it turns out that it got stuck here:
```
    checkpoint_dir = get_checkpoint_dir()
    if not os.path.exists(checkpoint_dir) and COMM_WORLD.rank == 0:
        os.makedirs(checkpoint_dir)
    COMM_WORLD.barrier()
```
This is probably because of the FileExistsError that I mentioned in this https://github.com/pyroteus/animate/pull/72#discussion_r1517671058. I don't know if you saw my comment afterwards on the same thread, but checkpoint_dir already gets created in get_checkpoint_dir - which means that the rank 0 process would sometimes get FileExistsError here, and it would keep hanging because there is a barrier afterwards. So I got rid of this completely and inside get_checkpoint_dir changed os.makedirs(checkpoint_dir) to os.makedirs(checkpoint_dir, exist_ok=True). This was why it was "temperamental".
Second barrier: I replaced the barrier with something else which acts like a barrier, using time.sleep and COMM_WORLD.bcast.

jwallwork23 commented 3 months ago

Nice work @ddundo! Thanks very much for this. I added a minor refactoring to avoid sleeping unnecessarily long.

This is now finally ready for review. Would you mind taking a look @stephankramer?

stephankramer commented 3 months ago

Ah sorry, I thought I had already left some comments - but these were still "pending". I think my point re. the checkpoint file naming may be relevant for the discussion above - I don't know if we are running any tests concurrently?

Re: the barrier. I'm afraid it's not really a solution to just not have barriers. The issue in the MFE you gave is that the error you raise does not call MPI.Abort - thus the process exits quietly leaving the other one hanging (correctly). This is prevented in firedrake world by adding an exception hook (see PyOP2/mpi.py). In general, it's just a fact of life that in parallel you won't always get complete output, hangs when error occurs.

I don't think calling makedirs with exists_ok=True is safe - you may prevent an error but you're still creating a race condition with multiple processes trying to create the same directory at the same time. The thing is, under the hood the making of a process still consists of several steps that the several processes go through and it is very well possible that at the point one process tries to detect whether the path already exists, another process has already started its process of creating that same path but without making any detectable changes yet. E.g. this is not safe in parallel

if not path.exists(PATH):
   makedir(PATH)

because multiple processes may pass the tests simultaneously.

Similarly replacing a barrier with sleep+bcast feels like voodoo programming to me (a common issue with people trying to debug MPI issues). The point of barriers is actually to make a parallel program more predictable. So if it doesn't work with a barrier there's probably something wrong...

ddundo commented 3 months ago

Thanks @stephankramer! I should say re. the second barrier I removed: I haven't run into errors there so leaving the second barrier works. I just removed it "in case" an error would be raised for whatever reason - but these are harder to predict than the FileExistsError in the first barrier (e.g. checkpointing the adapted mesh might theoretically fail because the user has no space left, etc.).

ddundo commented 3 months ago

@jwallwork23 and @stephankramer, how about then adding a try and except block? This would allow us to keep the barrier and still kill the program:

if COMM_WORLD.rank == 0:
    try:
        # checkpoint and adapt
    except Exception as e:
        PETSc.Sys.Print(f"{type(e).__name__}: {e}", comm=COMM_SELF)
        COMM_WORLD.Abort()
COMM_WORLD.barrier()

stephankramer commented 3 months ago

So to be clear, and also make sure I'm not misunderstanding things.

So I thought the issue you described was caused by creating the directory in two different places which is now fixed.

Then the reason this exception caused a hang is still unclear. Whenever we're in animate we have already import firedrake, which in turn imports pyop2 which should have added the exception hook that causes it to mpi.abort. Maybe this doesn't happen because the exception is caught by pytest first. Actually let me test that.....

Yes, if you make the following test:


import firedrake
from mpi4py import MPI 

@pytest.mark.parallel(nprocs=2)
def test_mine():

    if MPI.COMM_WORLD.rank == 0:
        raise ValueError("Test")
    MPI.COMM_WORLD.barrier()  # it hangs forever if there is a barrier
``
and run it under pytest, it hangs - whereas if you just run it with mpiexec and call test_mine() it doesn't (because of the firedrake import). So this is something we should probably report to pytest-mpi rather than trying to work around in a specific place. Note that this might happen anywhere when running in parallel, not just in the specific place where we fork of a separate code path on rank 0. Note that this has nothing to do with using barriers or not. Any time you run in parallel, and one of the processes silently exits (without calling abort) you will get a hang because somewhere further on another process will be waiting for a message that the exiting process is meant to have sent.

In any case, we need a barrier right after we create the directory. As we really should only give the makedir call on one rank, but all processes need to write to it initially (before rank 0  then reads it back in). So before they write we need a guarantee that the directory is actually created.

ddundo commented 3 months ago

Ah that's interesting! Thanks for explaining and testing this!

jwallwork23 commented 3 months ago

from mpi4py import MPI 
import pytest

@pytest.mark.parallel(nprocs=2)
def test_mine():

    if MPI.COMM_WORLD.rank == 0:
        raise ValueError("Test")
    MPI.COMM_WORLD.barrier()  # it hangs forever if there is a barrier

Thanks very much for looking into this in detail @stephankramer. So just to be clear, your recommendation is to post this as an issue on the pytest-mpi repo?

ddundo commented 2 months ago

This hangs again... I'll investigate

jwallwork23 commented 2 months ago

This hangs again... I'll investigate

Alas... but thanks!

ddundo commented 2 months ago

Okay this is ready now! Sorry for pushing to your branch - it was a minor fix :)

But it was still annoying to debug since it hangs and we don't get an error message. So I'd be keen to report this to pytest-mpi to avoid issues in the future, but I think it would be better if you or @stephankramer did it :)

Edit: I'll accept it since you can't review this PR @jwallwork23, but I won't merge it

jwallwork23 commented 2 months ago

Okay cool, thanks @ddundo. Will leave it to @stephankramer to re-review then.

ddundo commented 2 months ago

Thanks for reviewing @stephankramer!

@jwallwork23 could you please take a look and see if you agree, and then I can make the final changes?

jwallwork23 commented 1 month ago

Thanks for reviewing @stephankramer!

@jwallwork23 could you please take a look and see if you agree, and then I can make the final changes?

That'd be great, thanks @ddundo.

ddundo commented 1 month ago

@jwallwork23 I addressed everything now and added a tearDownClass method in test_checkpoint.py to delete all temporary directories created during testing there. I'm happy to merge if you are :)

jwallwork23 commented 1 month ago

@jwallwork23 I addressed everything now and added a tearDownClass method in test_checkpoint.py to delete all temporary directories created during testing there. I'm happy to merge if you are :)

Excellent! Merging now.

mesh-adaptation / animate

Option for adapting in serial #72

in adapt_script.py: