Open rabernat opened 2 years ago
Not sure if you are aware of the nascent code to kerchunk netCDF3 files: https://github.com/fsspec/kerchunk/pull/131 . This just needs some time and effort to finish, it's not too complicated.
@martindurant thanks for the pointer. Yes kerchunk netcdf3 support would be a great because then we could use open_input_with_kerchunk
here!
However, it seems like it would also be good to find an upstream solution to this, as it affects xarray users broadly.
Just to confirm this is not an xarray issue
%%filprofile
from scipy.io import netcdf_file
with fsspec.open(url, "rb") as fp:
ncf = netcdf_file(fp, mmap=False)
The kerchunk version subclasses from scipy to avoid the read (just reading my own code now...)
@jordanplanders and I recently discovered that https://github.com/pangeo-forge/staged-recipes/pull/176#issuecomment-1248826286 is blocked by this. Also because I don't see it directly tagged elsewhere here, https://github.com/pangeo-forge/staged-recipes/pull/140#issuecomment-1196887935 is blocked by this as well.
Looks like https://github.com/pangeo-forge/pangeo-forge-recipes/pull/383 is a near-complete solution. TBH, even if that PR were to go in, upgrading Pangeo Forge Cloud with new pangeo-forge-recipes
+ upstream dependency versions remains a somewhat toilsome process, requiring PRs to both:
I am optimistic that https://github.com/pangeo-forge/pangeo-forge-runner/issues/27 presents a path forward for dramatically 📉 reducing toil and 📈 increasing flexibility of incorporating bleeding-edge releases in Pangeo Forge Cloud.
netCDF3 may well work fairly well with kerchunk, not sure what the linked PR is blocked by.
I have not put a huge amount of effort into it the kerchunk implementation, so improvements are very likely easy to reach. In particular, large single arrays will still be accessed as a whole, but since they will have been written this way too, that's not likely to be a big memory problem.
@cisaacstern Can you speak to the specifics of the memory issue we were up against, in this context?
@jordanplanders I think there's a reasonable chance that merging #383 would solve the CiTRACE issue.
netCDF3 may well work fairly well with kerchunk, not sure what the linked PR is blocked by.
Following release of kerchunk 0.0.7, it's not clear to me that anything is blocking this. I'll follow up there.
I am trying to create a recipe with some netCDF3 files based on this example - https://discourse.pangeo.io/t/dask-xarray-and-swap-memory-polution-on-local-linux-cluster/2453
I have discovered an issue with the way netCDF3 / scipy / fsspec interact. The main issue is the scipy netcdf function has an
mmap
option which provides lazy access which only works with local files. It seems that otherwise we are eagerly loading the data.I profiled the following cases using fil.
Local file
fsspec local filesystem
Using gcsfs