pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

Coordinate better with NOAA BigData Project #234

Closed rabernat closed 6 years ago

rabernat commented 6 years ago

I recently read this interesting article on the UNIDATA news site: NEXRAD Archive data available on Amazon S3 see also: DOUBLE THE DATA: PUTTING WEATHER OBSERVATIONS IN THE CLOUD INCREASES ACCESS

the Big Data Project is an innovative approach to publishing NOAA's vast data resources and positioning them near cost-efficient high performance computing, analytic, and storage services provided by the private sector

This sounds extremely compatible with our goals in Pangeo; we are trying to provide a computational platform to make the most out of such cloud-based data stores. Using a pangeo cluster to process this data would be a perfect demonstration of the scientific potential of the cloud.

However, digging a little deeper, in a twitter exchange with @dopplershift, I learned that this data can't be read by xarray! https://twitter.com/dopplershift/status/991705085326626817

Based on @dopplershift's example notebook, it looks like the data is accessed via the siphon package as a siphon.cdmr.Dataset. (I can't find documentation on the cdmr module on the siphon docs.)

This seems like a missed opportunity! Considering that Unidata and NCAR employees are directly funded by the Pangeo award, I would hope we would be working together to make sure that the computational tools developed by Pangeo are compatible with these cloud data assets. Instead, the Unidata API for accessing the data is incompatible with our primary computational tools (xarray & dask).

Moving forward, how can we improve this situation? Some ideas:

mrocklin commented 6 years ago

It looks like they're hitting a THREADS server, yes? Does that mean that we're bottlenecked on a few machines on EC2 that Unidata is serving? If so then this might stop scalable analysis, going to S3 directly might be nicer.

This document seems to suggest that this is possible. They have a standard hierarchical structure around a bunch of chunks of data. A couple notes:

  1. It's not clear how the bytes of the individual chunk are formatted, at least not to me (maybe there is some assumed knowledge in this domain that I'm not aware of)
  2. They seem to be in bz2 format, which while decent for long-term archival, is very slow to decompress. I might recommend that they use a more modern compression standard like ZStandard, which has compression ratios appropriate for long term archival, but is much much faster.

I'd be happy to have a technical chat if folks at Unidata are interested.

rsignell-usgs commented 6 years ago

I gave a presentation in the NOAA Big Data session last week at the NOAA Environmental Data Management meeting in Silver Spring (15 min recording), and one of the points I made was that so far with the NOAA Big Data project, the data on S3 just as well might be on an FTP site, because we need to read it and then rewrite it to S3 using Zarr or HSDS, or some other more cloud-friendly format. These are points that @mrocklin already made in his blog post on HDF in the Cloud and I reference that in the talk.

The THREDDS Data Server uses NetCDF-Java and my understanding is that the S3 access via NetCDF-Java currently works by moving the entire NetCDF file off of S3 before access.

Unidata has submitted a proposal to enhance their NetCDF-C library to read Zarr, and eventually plans to support cloud-optimized formats in NetCDF-Java as well.

rabernat commented 6 years ago

Unidata has submitted a proposal to enhance their NetCDF-C library to read Zarr, and eventually plans to support cloud-optimized formats in NetCDF-Java as well.

This sounds extremely promising. Who at Unidata is leading that effort? Can we coordinate with them to ensure compatibility with Pangeo efforts?

rsignell-usgs commented 6 years ago

@WardF and @DennisHeimbigner at Unidata wrote the proposal. @Ethanrd (Ethan Davis) is another Unidata contact for coordination.

DennisHeimbigner commented 6 years ago

To be precise. https://github.com/WardF Is the PI and I am the CO-PI on the proposal.

rsignell-usgs commented 6 years ago

Thanks @DennisHeimbigner for the correction. Sorry @WardF!

dopplershift commented 6 years ago

Let’s uh...take a breath guys. While I appreciate the enthusiasm and gusto here, let me first add some background.

The NOAA big data project is a project exploring the migration of various NOAA data sets to many of the cloud vendors, including Google and AWS. AWS, based on customer/partner input, moved first with NEXRAD data, and the historical archive was migrated by CICS and NCEI; Unidata’s role is to update the archive in realtime with full volume scans (5 minute collections of 3D volumetric data) as well as an S3 bucket with individual chunks for pseudo-realtime updates``. More details are available in a recent publication.

The data are uploaded in a bz2-compressed chunk format, inside which are the radar system’s `native binary, message-based format. (For some of the older data, I believe it is gzipped files instead.) The choice of bz2 compression comes from the need for high compression ratios so that the realtime feed of data can be achieved over the network links available at the variety of radar sites across the country, some of which are very remote with microwave line of sight connections. This bz2 compression was chosen around 2002, and at the time using this for the realtime transmission of nexrad data was an incredible advancement. (Before this there was no realtime transmission, only sending and requesting 8mm tapes.) Many, many, many tools are available to read this data, including PyART, so the minimal migration path was to upload the data directly from the archive into S3, and to continue growing this archive with the same format. This decision was jointly made by AWS and NCEI.

This native binary format is available through standard S3 API access, allowing downloading individual volume scans of files (5-10 minute data). Since this is a pretty standard granularity for this data, I know people like @scollis have done some nice cloud-based things with this long term collection. Unfortunately, this format does not have any native xarray support; I’m not sure if something like pyart has any plans to move this way...

So in addition to the native S3 access, Unidata runs a THREDDS server that is layered on top of the S3 access. THREDDS, dating back to 2001 as well, pretty much assumes it has a local file to work with and that random access is efficient. Therefore, its support for S3 is really simple and involves downloading the file locally from S3 and then providing its services on top of this. Now, one of these services is called CDMRemote (based on google protobuf), which provides OPeNDAP-like (subset) access to datasets. I have written in Siphon a client for CDMRemote that plugs into XArray. Access in this fashion, currently, is not going to be any more efficient than XArray directly accessing the raw files in S3 (assuming support for the binary format was available in XArray). The impetus behind the THREDDS server is having support for our existing tools as well as more simplified date/time queries from the archive. Unidata is footing the bill for this server (including all data egress charges), so its access is limited to .edu domains.

I should also add that there is minimal funding involved in the NOAA Big Data project (I know for a fact that Unidata has received $0), so as far as I’m aware no one has time/resources for a massive reformatting of a 25+ year archive of data. Plus there’s a massive federal bureaucracy involved here on the NOAA side; I imagine you’re talking a minimum of 3 years (a pilot project plus a lengthy migration period involving all partners) just to change the realtime feed from BZ2 to ZStandard.

I’m happy to partner on doing interesting experiments, or even pursue funding for such efforts. I think we as a community can benefit greatly from improving how things are done in the cloud. But let’s slow down on the standard engineer line of thinking of “clearly this is a superior technical solution. Obviously the other people involved were not aware of this solution lest they would have chosen it. Let me correct this...”

rsignell-usgs commented 6 years ago

But let’s slow down on the standard engineer line of thinking of “clearly this is a superior technical solution. Obviously the other people involved were not aware of this solution lest they would have chosen it. Let me correct this...”

This may be the case for the NEXRAD data, but I know that for the National Water Model data, nobody is really using it yet, and Amazon is very interested in learning what N-dimensional format the community (researchers and developers) would find most useful. They know the big NetCDF files parked on S3 are not what the community wants, but they haven't been told what is more useful. This was confirmed last week during the NOAA Big Data session in Silver Spring. And that's partly why they funded our proposal to deploy Pangeo on AWS!

mrocklin commented 6 years ago

OK, thanks for the context @dopplershift . Brakes applied.

Perhaps the lesson here is that this community has some strong thoughts and some open questions on how ndarray data should be stored in the cloud to ensure efficient access from cloud-based computational frameworks. When we see people today not doing these practices we naturally speak up. Knowing that these decisions were made over a decade ago helps to provide perspective (also, doing this a decade ago is really impressive! I would not have wanted to be tasked with this :))

It seems like the concrete thing to do here is to move ahead with investigating cloud-friendly ndarray formats, and eventually turn that knowledge into something publicly digestible that other groups can hopefully use to guide future efforts.

scollis commented 6 years ago

Cool. Just note I could write a simple code using Py-ART that could run on an EC2 instance that made gridded data more X-Array amenable.

In addition a very long term desire is to change Py-ART's underlying data object from dictionaries containing numpy arrays to X-Array.. How this works I am working on :) (in my head)

jhamman commented 6 years ago

dictionaries containing numpy arrays

Is another way to describe xarray Datasets. Please feel free to engage us xarray devs when you're going to start work on this.

scollis commented 6 years ago

Absolutely will do.. Py-ART was born around a year before X-Array. If I designed it again X-Array would be integral

rabernat commented 6 years ago

@dopplershift: thanks for this detailed and useful reply!

But let’s slow down on the standard engineer line of thinking of “clearly this is a superior technical solution.

In my original post, I hope it is clear that I was not criticizing any of the technical choices made in the BigData project. Rather, I was lamenting the [currently] missed opportunity to connect these two cloud-centric projects happening at the same institution. And I was hoping to encourage discussion on how to get some synergy happening. (However, that discussion did quickly evolve in the direction you described.)

I have written in Siphon a client for CDMRemote that plugs into XArray.

This sounds cool and useful. In the example notebook, I saw multidimensional data variables with coordinates and attributes, so I naturally thought it would be straightforward to fit into xarray's data model. I hear your point that this would not necessarily be any more efficient than working directly with the numpy data. Within pangeo, we are trying to standardize around xarray as a lingua franca for data structures and computations. So even without massive gains in raw computational efficiency, there are possible productivity gains for users.

In general, I would love to talk more within Pangeo about siphon and its capabilities. It seems like a powerful technology that could solve some long-standing problems. Dialog is the goal! And it seems like we are achieving that here...so 👍.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.