Open gerritholl opened 4 years ago
This topic has been running around in my mind for the last 2 years at least. This is something I would like to improve and have actually applied for funding to work on it; specifically with regards to helping machine learning workflows in the cloud. Although this is a "nice to have" it will become (has already become) extremely important for certain groups.
That said there are some problems, but also some solutions, that need to be described.
Reading remote file systems is likely not something we are going to be able to do for all formats read by Satpy. The custom binary formats are typically memory mapped (np.memmap
) and require the file to be on a local disk. If we provide a utility for downloading a remote file, optionally storing it all in memory, then that may be a simple enough workaround that the user doesn't notice. Again, this would just be for custom binary formats.
Satpy assumes that every path to a file it gets are strings (or Python Path objects). This can be a problem for accessing remote file systems like S3-based storage where you often need some credentials to go along with that path. There are also some workflows that we've demonstrated with Satpy that work best with glob
and this isn't always available or easy for remote storage solutions.
One of the Pangeo group's biggest goals is to get scientists to move their compute (where they run the code) to where the data lives (the cloud). This means provide interfaces (jupyter hub or command line) to easily access these cloud resources and provide libraries (xarray, zarr, dask, etc) for taking full advantage of these resources in an efficient manner. The Pangeo group contributes to many open source projects dedicated to these technologies and these same goals. We try to work closely with them as much as possible.
The Intake library and fsspec library are two attempts (that work together) among many for making it easier to access catalogs of data in a consistent interface. One of the tasks I requested funding for recently was specifically to add Intake support to Satpy and vice versa. This would mean you could take an Intake catalog (perhaps accessing S3 storage) and give it to a Satpy Scene object and it would "just work". The vice versa part means that we can also write Intake drivers that use Satpy to read some of these formats. While this doesn't really provide much benefit for CF-compliant NetCDF files, it could mean a wider audience for those custom binary formats or any other format that doesn't come with all the necessary metadata to use it well.
OpenDAP has existed for a long time and provides an interface for accessing NetCDF files from remote locations. It allows the client to request specific chunks of data instead of downloading everything all at once. The problem with this is that it requires an OpenDAP server to serve the data and handle these requests. A popular service in meteorologic python is Unidata's THREDDS and these are easier to access when you the siphon
python package. However, cloud providers don't have this interface available for the data stored on their systems AFAIK. So this is a ray of sunshine and a problem at the same time.
The most recent release of HDF5 includes direct support for reading data from S3 storage. Theoretically this should mean that NetCDF4 files should work too. This also means that if the NetCDF4-python library can read them, then xarray can read them, which means Satpy can read them. As mentioned above, credentials would be needed in some cases and I'm not sure how this is handled.
This on top of some recent work by some USGS folks have shown that zarr and xarray can be modified to read HDF5 datasets from S3 just like a zarr dataset. This is another incremental step in data being more accessible for everyone.
I hope to work on this in the future, but would like to wait until I have funding or know I won't get funding. I really need to stop donating my time... :wink:
Problem 2 has been solved by pandas, or rather by s3fs (which pandas uses), see https://stackoverflow.com/a/51777553/974555
Feature Request
Is your feature request related to a problem? Please describe.
Currently, if I want to process in pytroll data that are not (mounted) on the local filesystem, I must first download the data using other services such as s3fs.
Describe the solution you'd like
It would be cool if satpy could read directly from a remote file system, such as pandas can do, so that one could do
or even including wildcards.
Describe any changes to existing user workflow
Not applicable.
Additional context
It's a nice-to-have, should not be prioritised too highly.