pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

Memory spike for lazily accessing netCDF3 file with scipy backend #361

Open rabernat opened 2 years ago

rabernat commented 2 years ago

I am trying to create a recipe with some netCDF3 files based on this example - https://discourse.pangeo.io/t/dask-xarray-and-swap-memory-polution-on-local-linux-cluster/2453

I have discovered an issue with the way netCDF3 / scipy / fsspec interact. The main issue is the scipy netcdf function has an mmap option which provides lazy access which only works with local files. It seems that otherwise we are eagerly loading the data.

I profiled the following cases using fil.

Local file

%%filprofile
url = "gcs://leap-scratch/rabernat/ERA5_HiRes_Hourly/cache/de66f7c4c230a196f5fad34f35355df2-https_cluster.klima.uni-
with fsspec.open("simplecache::" + url, "rb") as fp:
    ds = xr.open_dataset(fp.name, engine='scipy', backend_kwargs={'mmap': True})
image

fsspec local filesystem

%%filprofile
with fsspec.open("simplecache::" + url, "rb") as fp:
    ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False})
image

Using gcsfs

%%filprofile
bremen.de_fmaussion_teaching_climate_dask_exps_hires_hourly_surf_era5_hires_hourly_tp_2000_01.nc"
with fsspec.open(url, "rb") as fp:
    ds = xr.open_dataset(fp, engine='scipy', backend_kwargs={'mmap': False})
image
martindurant commented 2 years ago

Not sure if you are aware of the nascent code to kerchunk netCDF3 files: https://github.com/fsspec/kerchunk/pull/131 . This just needs some time and effort to finish, it's not too complicated.

rabernat commented 2 years ago

@martindurant thanks for the pointer. Yes kerchunk netcdf3 support would be a great because then we could use open_input_with_kerchunk here!

However, it seems like it would also be good to find an upstream solution to this, as it affects xarray users broadly.

rabernat commented 2 years ago

Just to confirm this is not an xarray issue

%%filprofile
from scipy.io import netcdf_file
with fsspec.open(url, "rb") as fp:
    ncf = netcdf_file(fp, mmap=False)
image
martindurant commented 2 years ago

The kerchunk version subclasses from scipy to avoid the read (just reading my own code now...)

cisaacstern commented 2 years ago

@jordanplanders and I recently discovered that https://github.com/pangeo-forge/staged-recipes/pull/176#issuecomment-1248826286 is blocked by this. Also because I don't see it directly tagged elsewhere here, https://github.com/pangeo-forge/staged-recipes/pull/140#issuecomment-1196887935 is blocked by this as well.

Looks like https://github.com/pangeo-forge/pangeo-forge-recipes/pull/383 is a near-complete solution. TBH, even if that PR were to go in, upgrading Pangeo Forge Cloud with new pangeo-forge-recipes + upstream dependency versions remains a somewhat toilsome process, requiring PRs to both:

I am optimistic that https://github.com/pangeo-forge/pangeo-forge-runner/issues/27 presents a path forward for dramatically 📉 reducing toil and 📈 increasing flexibility of incorporating bleeding-edge releases in Pangeo Forge Cloud.

martindurant commented 2 years ago

netCDF3 may well work fairly well with kerchunk, not sure what the linked PR is blocked by.

I have not put a huge amount of effort into it the kerchunk implementation, so improvements are very likely easy to reach. In particular, large single arrays will still be accessed as a whole, but since they will have been written this way too, that's not likely to be a big memory problem.

jordanplanders commented 2 years ago

@cisaacstern Can you speak to the specifics of the memory issue we were up against, in this context?

cisaacstern commented 2 years ago

@jordanplanders I think there's a reasonable chance that merging #383 would solve the CiTRACE issue.

netCDF3 may well work fairly well with kerchunk, not sure what the linked PR is blocked by.

Following release of kerchunk 0.0.7, it's not clear to me that anything is blocking this. I'll follow up there.