Feature request: Use satellite data stored at AWS

raybellwaves commented 4 years ago

@cgentemann has an example of how to access GOES data via AWS https://github.com/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/Access_cloud_SST_data_examples.ipynb

This is also related to https://github.com/pytroll/satpy/issues/1287

cgentemann commented 4 years ago

Thanks! two things --- 1 - the GOES netCDF file formats are a bit of a mess, we will be updating the code later today to clean it up a bit as we read it in. @pahdsn is working on this.
2 - the netCDF files are slow -- accessing a day takes about 3 min. I'm hoping soon they will be in Zarr. The difference between accessing them in netCDF versus Zarr is a bit striking. https://nbviewer.jupyter.org/github/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/goes-cmp-netcdf-zarr.ipynb

If you go to https://github.com/oceanhackweek/ohw20-tutorials you can run the example yourself with the Binder link at the bottom.

djhoese commented 4 years ago

@cgentemann I'm rereading that goes-cmp-netcdf-zarr example, any idea what the chunk size defaulted to for the netcdf files?

cgentemann commented 4 years ago

as far as I can tell, the original netcdf file has no internal chunking == each netcdf file has 1 chunk == 1x5424x5424

djhoese commented 4 years ago

Ah ok so that matches with the zarr dataset. I was curious if the chunk size was playing a role in the timing at all. Looks like it is mostly just data access. Thanks.

cgentemann commented 4 years ago

yes, I'm actually hoping someone might jump in here with an explanation. We didn't change the chunking on purpose, to make the comparison as much apple-to-apple. The decrease in initial access time makes sense because now all the metadata is consolidated. The decrease in the analysis time I'm not sure I understand - maybe it has something to do with zarr concurrent reads?

also, i've generalized the read routine to read all the goes aws data (not just SST). I'll post a link in a day or two. No power here right now.

pnuu commented 4 years ago

Very interesting test!

What is the chunking of the zarr data? My guess is that the (possible) native chunking in the zarr version speeds up the processing as less data are downloaded for the sub-region cropped from the full data.

Could you also time the it takes to run the fs.glob() calls for the NetCDF version? I've never used S3, but have heard that these "filesystem" operations can be rather slow. Or are there other parts in get_geo_data() that causes most of the slowness? Timing shorter segments of that function would be very interesting to see what's the real bottleneck.

djhoese commented 4 years ago

What is the chunking of the zarr data?

The chunk size is the same as the netcdf (1x5424x5424).

raybellwaves commented 4 years ago

I created an end-to-end example here: https://gist.github.com/raybellwaves/4dd2f1472468e9f67424b6a148e9ac18

It could be improved upon an added to the repo to supplement other Himawari examples: https://github.com/pytroll/pytroll-examples/blob/master/satpy/HRIT%20AHI%2C%20Hurricane%20Trami.ipynb https://github.com/pytroll/pytroll-examples/blob/master/satpy/ahi_true_color_pyspectral.ipynb Those could also be updated if there data is available on AWS.

The gist could be updated by making a dir, downloading data, saving the fig then deleting the downloaded data.

The next thing to test would be 'streaming' the data to avoid having to download the data locally.

In addition, one thing I would be interested in - could slot on the end of this example - is how to save a true color image of the full disk as a e-mailable size limit (< 20 Mb) e.g. there was chat in the slack about using tiled=True when saving as a geotiff (https://pytroll.slack.com/archives/C0LNH7LMB/p1599313293263100)

djhoese commented 4 years ago

@raybellwaves Very nice. A couple things:

Recently @gerritholl added the ability to pass a file system object to satpy's find_files_and_readers. This may simplify or provide a different style of globbing for files on an S3 store.
Recently the NetCDF C library was updated by Ryan May to allow for #mode=bytes on HTTP URLs so the library can do byte range requests. This works for S3 backends too. I haven't made the pull request yet but posted about it in the satpy channel on slack:

--- satpy/readers/yaml_reader.py    (revision 0de817e6d4599e971724affc9f719f9aebc41ff8)
+++ satpy/readers/yaml_reader.py    (date 1599314347246)
@@ -69,6 +69,9 @@
     """Get the end of *path* of same length as *pattern*."""
     # convert any `/` on Windows to `\\`
     path = os.path.normpath(path)
+    # remove possible #mode=bytes URL suffix to support HTTP byte range
+    # requests for NetCDF
+    path = path.split('#')[0]
     # A pattern can include directories
     tail_len = len(pattern.split(os.path.sep))
     return os.path.join(*str(path).split(os.path.sep)[-tail_len:])

In [5]: url = "https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2019/001/00/OR_ABI-L1b-RadC-M3C14_G16_s20190010002187_e20190010004560_c20190010005009.nc#mode=bytes"
In [6]: scn = Scene(reader='abi_l1b', filenames=[url])
In [7]: scn.load(['C14'])
  proj_string = self.to_proj4()
In [8]: scn.show('C14')
Out[8]: <trollimage.xrimage.XRImage at 0x7f444e7651d0>

I'm not saying we can't incorporate your usage directly, but might be nice with the rest of your suggestions to include something like this where the files don't have to be downloaded to disk.

pytroll / pytroll-examples

Feature request: Use satellite data stored at AWS #35