Read from remote file system

gerritholl commented 4 years ago

Feature Request

Is your feature request related to a problem? Please describe.

Currently, if I want to process in pytroll data that are not (mounted) on the local filesystem, I must first download the data using other services such as s3fs.

Describe the solution you'd like

It would be cool if satpy could read directly from a remote file system, such as pandas can do, so that one could do

sc = Scene(["s3://noaa-goes16/ABI-L1b-RadC/2017/059/12/OR_ABI-L1b-RadC-M3C16_G16_s20170591257496_e20170591300281_c20170591300308.nc"])

or even including wildcards.

Describe any changes to existing user workflow

Not applicable.

Additional context

It's a nice-to-have, should not be prioritised too highly.

djhoese commented 4 years ago

This topic has been running around in my mind for the last 2 years at least. This is something I would like to improve and have actually applied for funding to work on it; specifically with regards to helping machine learning workflows in the cloud. Although this is a "nice to have" it will become (has already become) extremely important for certain groups.

That said there are some problems, but also some solutions, that need to be described.

Problem 1 - File formats

Reading remote file systems is likely not something we are going to be able to do for all formats read by Satpy. The custom binary formats are typically memory mapped (np.memmap) and require the file to be on a local disk. If we provide a utility for downloading a remote file, optionally storing it all in memory, then that may be a simple enough workaround that the user doesn't notice. Again, this would just be for custom binary formats.

Problem 2 - Paths as strings

Satpy assumes that every path to a file it gets are strings (or Python Path objects). This can be a problem for accessing remote file systems like S3-based storage where you often need some credentials to go along with that path. There are also some workflows that we've demonstrated with Satpy that work best with glob and this isn't always available or easy for remote storage solutions.

Ray of sunshine 1 - Pangeo

One of the Pangeo group's biggest goals is to get scientists to move their compute (where they run the code) to where the data lives (the cloud). This means provide interfaces (jupyter hub or command line) to easily access these cloud resources and provide libraries (xarray, zarr, dask, etc) for taking full advantage of these resources in an efficient manner. The Pangeo group contributes to many open source projects dedicated to these technologies and these same goals. We try to work closely with them as much as possible.

Ray of sunshine 2 - Intake and fsspec

The Intake library and fsspec library are two attempts (that work together) among many for making it easier to access catalogs of data in a consistent interface. One of the tasks I requested funding for recently was specifically to add Intake support to Satpy and vice versa. This would mean you could take an Intake catalog (perhaps accessing S3 storage) and give it to a Satpy Scene object and it would "just work". The vice versa part means that we can also write Intake drivers that use Satpy to read some of these formats. While this doesn't really provide much benefit for CF-compliant NetCDF files, it could mean a wider audience for those custom binary formats or any other format that doesn't come with all the necessary metadata to use it well.

Ray of sunshine 3 - OpenDAP

OpenDAP has existed for a long time and provides an interface for accessing NetCDF files from remote locations. It allows the client to request specific chunks of data instead of downloading everything all at once. The problem with this is that it requires an OpenDAP server to serve the data and handle these requests. A popular service in meteorologic python is Unidata's THREDDS and these are easier to access when you the siphon python package. However, cloud providers don't have this interface available for the data stored on their systems AFAIK. So this is a ray of sunshine and a problem at the same time.

Ray of sunshine 4 - HDF5/NetCDF4 S3 support

The most recent release of HDF5 includes direct support for reading data from S3 storage. Theoretically this should mean that NetCDF4 files should work too. This also means that if the NetCDF4-python library can read them, then xarray can read them, which means Satpy can read them. As mentioned above, credentials would be needed in some cases and I'm not sure how this is handled.

This on top of some recent work by some USGS folks have shown that zarr and xarray can be modified to read HDF5 datasets from S3 just like a zarr dataset. This is another incremental step in data being more accessible for everyone.

The Future

I hope to work on this in the future, but would like to wait until I have funding or know I won't get funding. I really need to stop donating my time... :wink:

gerritholl commented 4 years ago

Problem 2 has been solved by pandas, or rather by s3fs (which pandas uses), see https://stackoverflow.com/a/51777553/974555

pytroll / satpy