pangeo-data / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
183 stars 32 forks source link

Calling regridder causes mpirun to fail #267

Closed ashjbarnes closed 9 months ago

ashjbarnes commented 1 year ago

I'm writing a pipeline that needs to do some 'small' regridding tasks, and one 'big' one. I call xesmf.Regridder()for the small tasks, then use subprocess('mpirun ESMF_RegridWeightGen...') for for the big one

However, when running the xesmf regridder, subsequent subprocess calls fail returning: CompletedProcess(args='mpirun ESMF_RegridWeightGen -s bathy_original.nc -d topog_raw.nc -w weights[/bathyweights.nc](https://file+.vscode-resource.vscode-cdn.net/bathyweights.nc) -m bilinear --src_regional --dst_regional', returncode=1) To reproduce, run this code, then uncomment the xesfm regridding line (mpirun should work) and run again.

import subprocess
import xesmf as xe
import numpy as np
import xarray as xr

X,Y = np.meshgrid(np.linspace(0,100,70),np.linspace(0,100,70))

a = xr.Dataset(
    {"lon":(["lon"],np.linspace(0,50,50)),
    "lat":(["lat"],np.linspace(0,50,50))
})

b = xr.Dataset(
    {"data":(["lon","lat"],X),
     "lon":(["lon"],np.linspace(0,50,70)),
     "lat":(["lat"],np.linspace(0,50,70)),
     }
)

### If you don't call xe.Regridder() then the subprocess call works
regridder = xe.Regridder(
    b.data, a, "bilinear",
)

## If you've called the regridder, the below simply returns error code 1 with no other message. It works if you remove 'mpirun', so presumably the regridder is messing up mpirun somehow?
subprocess.run(
    "mpirun ESMF_RegridWeightGen -s IN.nc -d OUT.nc -w weights/bathyweights.nc -m bilinear",shell = True
)
angus-g commented 1 year ago

This is happening because the ESMF being used from xesmf is configured with MPI support. When the regridder is called, MPI is initialised within the context of the Python process. OpenMPI doesn't support recursively running MPI, so it aborts immediately (related: https://github.com/open-mpi/ompi/issues/9729).

I think the RegridWeightGen step needs to be performed either before, or external to the Python script which performs the regridding (see https://xesmf.readthedocs.io/en/latest/large_problems_on_HPC.html for suggestions).

ashjbarnes commented 1 year ago

Thanks @angus-g !

Yeah that's a good fix for now. I wonder if there's a way to get your kernal to purge mpi after a regridder call? Calling orte-clean doesn't seem to do anything and you can't kill orted without killing the whole kernel

Or, is it possible to run xesmf without mpirun at all for the smaller tasks?

huard commented 10 months ago

Hi,

Unless there's a proposal for a specific change to xESMF, I'd close this. Thoughts ?

angus-g commented 9 months ago

I think it's probably fine to close this. At most a note about why this occurs could go somewhere, but maybe people will stumble on this thread anyway!