Efficient access to IOOS data in the cloud

rsignell-usgs commented 2 years ago

1 1jqgZt_YkMHiOAfvoOsYIg

Project Description:

Want to spend your summer getting paid by Google to improve our ability to work with climate, weather and remote sensing data in the Cloud? Come join us to work on Kerchunk, a Python package that turbocharges the worlds most common scientific data formats, allowing efficient, parallel, cloud-native access!

Much of IOOS data, especially that from models, is in NetCDF format. Currently it's served primarily through ERDDAP and THREDDS, but in the best interests of open science, IOOS data could be made available on the Cloud, and services like ERDDAP and THREDDS layered on top if necessary.

While new cloud-performant formats like Zarr have been created to represent the NetCDF Data Model, it has been shown that NetCDF files themselves can be made cloud-performant by creating a JSON file that makes a collection of JSON files readable by the Zarr library.

There is a package that assists in the creation of these JSON files, called Kerchunk. It currently reads a collection of NetCDF4, GRIB2 files but could be expanded to cover a wider array of datasets, including NetCDF3, a common format used in IOOS.

The student will work with Kerchunk to expand the capablities of kerchunk and develop pipelines that convert massive collections of forecast and remote sensing data on the Cloud into virtual Zarr datasets that can be used efficiently and effectively in Python based workflows, for example, https://registry.opendata.aws/ecmwf-era5/.

See this Medium blog post for a description of this powerful unifying approach to handling scientific data in the Cloud: https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935

Expected Outcomes:

The GSoC student would work with Mentors to extend the current code base, and generate Jupyter notebooks, documentation and blog posts that demonstrate the new capabilities added, and workflows generated. All work will be done on github, and weekly virtual meetings will take place with mentors.

Skills required:

Python

Difficulty:

Moderate

Mentor(s):

@rsignell-usgs (researcher oceanographer, USGS)

@martindurant (professional open-source software developer, Anaconda, Inc)

martindurant commented 2 years ago

Link to documentation for kerchunk, the library into which this project's code would be added: https://fsspec.github.io/kerchunk/

I would expand the expected outcomes section:

A module would be added to the kerchunk codebase to scan and find the binary offsets within remote netCDF3 files. Documentation for the use of this new functionality would be produced, to appear in the kerchunk docs pages. Example datasets would be found and scanned, producing aggregated JSON files over many input data files. The process to produce the JSON files would be demonstrated in Jupyter notebooks, and separate Jupyter notebooks would demonstrate the speed and flexibility of using zarr to read the original data. The datasets would be added to the catalog of examples in the kerchunk project.

cgentemann commented 2 years ago

I would expand the paragraph on data a little from:

While new cloud-performant formats like Zarr have been created to represent the NetCDF Data Model, it has been shown that NetCDF files themselves can be made cloud-performant by creating a JSON file that makes a collection of JSON files readable by the Zarr library.

To:

NetCDF4 data can be stored and accessed on the cloud, but it isn't 'fast'. Cloud-performant formats are designed for performance on the cloud. While new cloud-performant formats like Zarr have been created to represent the NetCDF Data Model, it has been shown that NetCDF files themselves can be made cloud-performant by creating a JSON file that makes a collection of JSON files readable by the Zarr library.

martindurant commented 2 years ago

a JSON metadata file that makes a collection of netCDF files readable directly by the Zarr library. There is no need to transcode or duplicate the original data.

cgentemann commented 2 years ago

Yes!

7yl4r commented 2 years ago

@bbest : Maybe extractR is relevant here?

Not directly, but as an alternative approach built on ERDDAP.

martindurant commented 2 years ago

@7yl4r , can you please provide a link and/or short description of this?

7yl4r commented 2 years ago

https://github.com/marinebon/extract-app - The application serves as a middleman between ERDDAP and the user, exposing an even higher-level API for users looking to analyze spatiotemporal grids.

I see this thread is more performance-minded, but from what I have seen the ease-of-access to netcdf data is more limited by technical capacity than it is computational limitations. This mention is relevant to the title, but perhaps should be broken out as a separate idea issue.

martindurant commented 2 years ago

Thanks for the link, @7yl4r , but I'm afraid I still don't follow. This proposal is both about providing efficient parallelised IO , AND logical datasets that view a large collection of input datasets as a single entity with coordinates over the various axes.

rsignell-usgs commented 2 years ago

@7yl4r , ERDDAP is designed to allow access to IOOS data stored on regular file systems -- this proposal targets cloud object storage, which allows direct extraction of the data using object storage services (without extra data services).

7yl4r commented 2 years ago

I am missing something here. Direct extraction to the data via object store would be serving files individually? Individual file access is easy though. DAP provides an API for grabbing data less dependent on the files themselves, and the extract-app aims to close the gap between DAP and user needs. Can you help me understand the use case for this project? Is the aim to create a more performant DAP alternative?

martindurant commented 2 years ago

There is no server in our architecture at all. All it requires is a previously generated small references sidecar file, which can be stored at a separate location to the main data.

We provide the following things, some of which may be covered by other tools:

metadata consolidation, so you can understand a many-file dataset (metadata plus physical storage) in a single read
read from all of the storage backends supported by fsspec, including object storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs, smb...)
loading of various file types (currently netcdf4/HDF, grib2, tiff, fits, zarr), potentially heterogeneous within a single dataset, without a need to go via the specific driver (e.g., no need for h5py)
asynchronous concurrent fetch of many data chunks in one go, amortizing the cost of latency
parallel access with a library like zarr without any locks
logical datasets viewing many (>~millions) data files, and direct access/subselection to them via coordinate indexing across an arbitrary number of dimensions

7yl4r commented 2 years ago

Thank you for adding detail here. At a minimum I am at least understanding that I am over my head.

It sounds like a lot of value has been created and it seems that I personally would benefit from documentation and examples built by a GSoC student. I will look forward reading and reviewing the materials.

ocefpaf commented 2 years ago

@martindurant and @rsignell-usgs are you getting candidates for this project? GSoC is having a lower engagement this year due to its many changes. You may need to advertise if on twitter and other venues to draw more attention to it.

rsignell-usgs commented 2 years ago

@ocefpaf , nope. It's definitely a cool project and I'll tweet about it. Use #GSoC22 right?

martindurant commented 2 years ago

@rsignell-usgs , if you put something out, I'll retweet and can get Anaconda and/or zarr to do so too. That should reach enough people.

cc @hdsingh , if you have any contacts that might be interested in working with us.

BjoernMayer92 commented 2 years ago

@rsignell-usgs, @martindurant

Hello, I am a PhD Candidate in Climate Science from Germnay. I am working on the interface between machine learning and climate science trying to extract new insight by applying explainable AI methods to data from earth system models. I am a trained physicist that switched to climate science and a mostly self taught programmer. Therefore, my coding skills are not on a level of a trained software developer, but I am very enthusiastic about stepping up my coding skills and to internalize best practices. I would be interested in submitting a proposal if you think this might be a good idea given my current coding skills and experience.

Decenov commented 2 years ago

@rsignell-usgs, @martindurant Hello, Hope it’s not too late to reach out. I’m a 2nd-year Ph.D. student in physical oceanography from Hong Kong. Pretty similar background to the preview candidate, but my topic is related to ML and ocean models. I have worked on NetCDF, Zarr files from models and observations, and done data analysis using Xarray and Pandas, etc, since my Masters' projects and internships.

With lots of experience, I definitely can handle the part of writing Jupyter notebooks and documentation to demonstrate features for new users. Although I have no CS background, I'm always interested in contributing to Open source and learning more about cloud computing. I would be grateful if you could kindly advise me on the application proposal.

Thanks!

martindurant commented 2 years ago

@BjoernMayer92 , @Decenov : thank you both for reaching out. As the description describes, there are a few options in this project, depending on how much coding you are comfortable with. I do encourage you to look at the kerchunk repo (and, less importantly, fsspec.implementations.reference) to get a feel for how complex the current code stack is. Obviously, the more you know about xarray and zarr the better, not to mention the specific file formats - but all of that can be learned.

As more scientifically leaning candidates, you may wish to focus on making reference sets of interesting data, and building notebooks/blogs which demonstrate a) how that process happens and b) benchmarking and ease-of-use demonstrations for real scientific analysis operations - on remote data with a cluster. This might even help your research in the long term :)

rsignell-usgs commented 2 years ago

@BjoernMayer92 and @Decenov: I agree with @martindurant that the sweet spot here might be setting up workflows to generate JSON using kerchunk for some high impact datasets, for example: https://registry.opendata.aws/ecmwf-era5/. Forecast dataset JSON files need to be updated regularly as new forecasts arrive. In addition to generating the JSON-generating workflows, you could also develop analysis and visualization notebooks which demonstrate the power of the kerchunk/referenceFileSystem/virtual-Zarr dataset approach!

Decenov commented 2 years ago

@rsignell-usgs, @martindurant Thank you for the replies. I have submitted an informal proposal. I’ll update it by the deadline. I know it’s already the Easter holiday, but feel free to give me any feedback whenever you are available. Thanks

peterm790 commented 2 years ago

Hello, @rsignell-usgs and @martindurant,

Unfortunately I am also somewhat late to the party here. I am a recent Master's graduate in atmospheric science with experience working with xarray, zarr, etc. and looking for opportunities to gain more CS experience.

I have had a read through the kerchunk repository and believe a useful project could be to interface the current operational GFS data (https://registry.opendata.aws/noaa-gfs-bdp-pds/) to kerchunk and perhaps ultimately to an xarray data-tree. I have had an initial look and believe the current iteration of scan_grib does not correctly interface to the GFS grib data, similarly to GEFS in the issue: https://github.com/fsspec/kerchunk/issues/150. I am unsure how much of a challenge adapting this would be? but I am certainly happy to have a go at it, as presently accessing certain GFS variables through python is a challenge let alone in a cloud performant manor.

I hope this is the sort of project you are looking for. I realise the deadline is fast approaching so I will go ahead and submit an initial proposal and will be happy take any feedback or consider a different direction should you suggest.

Thanks.

rsignell-usgs commented 2 years ago

@peterm790 yes, that's definitely an interesting project to pursue! By all means submit a proposal!

martindurant commented 2 years ago

Work will be happening in https://github.com/fsspec/GSoC-kechunk-2022 . We can close this issue now, or at the end of the project.

ocefpaf commented 1 year ago

Closing all past GSoC issues. Please open a new issue if you want to participate in GSoC23.

ioos / gsoc