pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
705 stars 189 forks source link

Pangeo and Google Earth Engine #216

Closed jhamman closed 5 years ago

jhamman commented 6 years ago

While at EGU this week, I talked to @tylere about potential of connecting the Google Earth Engine (GEE) API with Pangeo. Our conversation had a few take away points:

We discussed developing a use case that leverages GEE and Pangeo in concert with one-another. @tylere - you showed me a notebook that uses the GEE API, perhaps we can start by getting something like that as a Pangeo example?

Long arc, it would be quite interesting to think about a deeper integration, perhaps where we could extract datasets from GEE in the form of xarray/dask arrays.

mrocklin commented 6 years ago

This sounds like a fun interaction point. Is this something that XArray would consider adding as direct API or is this external?

On Sat, Apr 14, 2018 at 4:22 AM, Joe Hamman notifications@github.com wrote:

While at EGU this week, I talked to @tylere https://github.com/tylere about potential of connecting the Google Earth Engine (GEE) API with Pangeo. Our conversation had a few take away points:

  • GEE is better positioned for GIS operations on big datasets
  • Pangeo is better positioned for working with multi-dimensional datasets
  • There is a connection point between those two paradigms

We discussed developing a use case that leverages GEE and Pangeo in concert with one-another. @tylere https://github.com/tylere - you showed me a notebook that uses the GEE API, perhaps we can start by getting something like that as a Pangeo example?

Long arc, it would be quite interesting to think about a deeper integration, perhaps where we could extract datasets from GEE in the form of xarray/dask arrays.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/216, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszPD3dBuWuuAejKlup1qeB9E8Qhpiks5tob_qgaJpZM4TU_2R .

jhamman commented 6 years ago

Is this something that XArray would consider adding as direct API or is this external?

I think it is too early to say. One idea we discussed was to add conversion functions to GEE or Xarray (e.g. to_{xarray, gee} and from_{xarray, gee}). From my perspective, these would fit best in GEE or in a 3rd party package but xarray does have analogs already so it could fit there as well.

First step is probably for @tylere and I (or others with interest) to get a working example of the 2 packages working together. If that shows potential, we can come back to adding something in the most appropriate package.

guillaumeeb commented 6 years ago

Extending Pangeo to GIS data is something that is really appealing to me for @CNES needs. However, depending on GEE is a complicated debate. It would be more interesting for us to work with projects like https://github.com/opendatacube, as discussed in #96. Have you been further with @andrewdhicks or @woodcockr?

tylere commented 6 years ago

To clarify Joe's original description, GEE's strength is in analyzing raster and vector data that is stored in arbitrary geographic projections (i.e. not a data cube). The majority of data in our public data catalog is remote sensing imagery from USGS, NOAA, NASA, & ESA, but we also have digital elevation models, vector boundaries, weather forecasts, climate projections, etc.

@guillaumeeb we did not discuss Pangeo depending on GEE, rather we discussed demonstrating ways that Jupyter Notebooks and Widgets can be used to explore and analyze petabyte-scale geospatial backends, whether they were stored in Pangeo or GEE (or somewhere else). Here is an example from AGU 2017 that just uses demonstrates using Widgets with Earth Engine as a backend (slides, github).

I have an presentation accepted for JupyterCon that is scheduled before the Pangeo JupyterCon talk. One idea would be to think up a compelling use case that would demonstrate pulling processed satellite data from GEE and processed weather/climate from Pangeo into a Jupyter Notebook.

jhamman commented 6 years ago

@guillaumeeb - your point is well taken but really offers me an opportunity to emphasize the beauty of the pangeo ecosystem. There are no dependencies. We are working hard to create an ecosystem of tools that interoperate well together and where, when necessary, can be used interchangeably. A few examples that illustrate this concept well:

I think a key success of our effort to date has been the separation of concerns and the ability to interoperate.

Have you been further with @andrewdhicks or @woodcockr

No, but I'd really like to see them engage here. They seemed quite busy when we spoke last, but perhaps now is a better time to engage with them.


@tylere - thanks for providing some clarification. I think we're on the same page. Where do you think the best place to start is? You had mentioned developing some estimates of evapotranspiration using vegetation data from GEE and climate data from pangeo? This seems like a nice target and something we could do now. Can you point to an example of pulling data out of GEE and working with numpy-like arrays? The first step will be to do that for something like leaf area index and convert it to an xarray dataarray.

jhamman commented 6 years ago

@guillaumeeb - your point is well taken but really offers me an opportunity to emphasize the beauty of the pangeo ecosystem. There are no dependencies. We are working hard to create an ecosystem of tools that interoperate well together and where, when necessary, can be used interchangeably. A few examples that illustrate this concept well:

I think a key success of our effort to date has been the separation of concerns and the ability to interoperate.

Have you been further with @andrewdhicks or @woodcockr

No, but I'd really like to see them engage here. They seemed quite busy when we spoke last, but perhaps now is a better time to engage with them.


@tylere - thanks for providing some clarification. I think we're on the same page. Where do you think the best place to start is? You had mentioned developing some estimates of evapotranspiration using vegetation data from GEE and climate data from pangeo? This seems like a nice target and something we could do now. Can you point to an example of pulling data out of GEE and working with numpy-like arrays? The first step will be to do that for something like leaf area index and convert it to an xarray dataarray.

woodcockr commented 6 years ago

@guillaumeeb @jhamman - Hi folks. Certainly we are interested in following up on the earlier conversation and some better engagement. As @jhamman mentioned we are a tad to busy to engage properly. I'm hoping we can create some clear air during May for further engagement.

tylere commented 6 years ago

@jhamman I think the next step should be additional brainstorming on use cases that could benefit from data drawn from both GEE and Pangeo. I think the evapotranspiration use case is important, but too complex for an initial demonstration. I'd like to think of some simpler examples, that will be relevant to a wider audience.

Converting EE output to a dataframe is still a bit of a manual process, and would be nice to make this easier in the future. Fwiw, this notebook contains a function GetDataFrame() that demonstrates converting an Earth Engine response into a dataframe for a time series.

mrocklin commented 6 years ago

One non-scientific use case is education. I can imagine integrating this into tutorials at conferences to help people get a better sense of using XArray and Dask together on remote datasets.

On Wed, Apr 18, 2018 at 3:11 PM, Tyler Erickson notifications@github.com wrote:

@jhamman https://github.com/jhamman I think the next step should be additional brainstorming on use cases that could benefit from data drawn from both GEE and Pangeo. I think the evapotranspiration use case is important, but too complex for an initial demonstration. I'd like to think of some simpler examples, that will be relevant to a wider audience.

Converting EE output to a dataframe is still a bit of a manual process, and would be nice to make this easier in the future. Fwiw, this notebook https://github.com/tylere/AGU2017/blob/master/notebooks/satellite_analysis.ipynb contains a function GetDataFrame() that demonstrates converting an Earth Engine response into a dataframe for a time series.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/216#issuecomment-382497297, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszFsRHRUSkx3guJ6jx-kMaKX_WEbMks5tp4_kgaJpZM4TU_2R .

rabernat commented 6 years ago

Converting EE output to a dataframe is still a bit of a manual process

Just a technical clarification: although it shares some features with pandas, xarray (a central package for Pangeo) deals with multi-dimensional numeric arrays, not dataframes. The core underlying data of xarray objects are just numpy arrays (or their out-of-core counterpart dask.arrays).

I imagine that the raster data used by EE is similarly, at its core, multi-dimentional numeric arrays.

So perhaps mutual compatibility is not so far off. Xarray's rasterio example could be a good place to start.

AlexHilson commented 6 years ago

I was just at the earth engine summit in Dublin, had some interesting conversations with some Googlers (thanks @tylere et al).

There's an Earth Engine API in the works, with the data api in early access (incl GDAL interface). There are plans later to make computation queries available via that api too. I've asked to be added to the early access program, my gut feeling is it shouldn't be too hard to build dask / xarray / iris objects using that API.

They have some future plans to allow EE to use cloud optimised GeoTIFFs directly from GCP object store rather than requiring ingestion first. I can't imagine it will be feasible for us to convert our data into GeoTIFF, but I think it would be really cool to make atmospheric datasets more easily available through EE so I'll have a play when that's available.

That's the main thing I took away as a mini project to work on, but I thought there were some interesting things to consider about the EE architecture too. The white paper is here

  1. Data is stored as files on Googles internal colossus storage system. each file is a collection of 256x256 tiles (some datasets fit entirely in one file, some are split up)
  2. Metadata stored in a separate database. Lazy objects are created by querying this database
  3. The client libraries create a more abstract DAG, which is then expanded server side (which is how they avoid moving huge graphs around)
  4. Users have quotas but tasks are handled by a single scheduler system (also means shared task cache for all users - although they didn't have numbers for how much this helped).

Number 2 is a project I was interested in hacking on before the summit anyway, so hopefully we'll (informatics lab) get something similar running soon. I've had some early success directly building lazy objects from metadata rather than going through the xarray / iris loading process (blog post inc). There are a lot of possible paths here; reading nc headers every time, intake catalogues, permanent metadata store.

Number 4 is something I've long been curious about. I appreciate the security / stability reasons for our 1 user 1 scheduler approach, but my gut feeling is that something smarter is needed to really make the most of every cpu second. I'd hoped that maybe we'd solve this with Kubernetes oversubscribing each cpu, but that doesn't seem to be feasible (at least with my limited understanding of Kubernetes 😄 )

Now I'm getting way ahead of myself but r.e. number 1 I have some interest at some point in hacking on IPFS to a give us a globally unique (cloud agnostic) global namespace (still using object stores as the storage back end).

woodcockr commented 6 years ago

Thanks for sharing this @AlexHilson its an interesting development and one of many underway regarding EO data in the Cloud. I'm currently Vice Chair for the CEOS Working Group on Information Systems and Services WGISS . All the space agencies collecting and distributing EO data are currently grappling with how to better make it available in the Clouds for analytical use. It falls under a category of work called Future Data Access and Analytics Architectures (FDA for short - more info here.

The first report from this group resulted in an escalation of the FDA challenges to a broader cross-cutting concern across CEOS working groups. The more technology related issues sit with WGISS - enhancing Discovery and Access mechanisms, identifying key architectural aspects, sharing ideas and awareness of how different agencies are placing data in the Clouds (including with whom). More science related aspects like publishing Analysis-Ready-Data (ARD), so different observations can be used more consistently together, or publishing provenance information, are covered in other CEOS working groups.

The goal here isn't to come up with one way to do things (it would kill innovation), but to have CEOS agencies contributing to the ecosystem in way to help us all do the things we want in a more manageable way.

Why do I mention this?

Because the low level behind the scenes aspects of xarray, dask etc are impacted by the choices made on storage and computational paradigm and importantly, do a useful job in hiding some of that complexity from the high level APIs. This is why in the Open Data Cube we use xarray, it fits nicely with our original intent of having a nd-array paradigm for analysis regardless of computational platform. Separation of metadata into a database also facilitates better handling of more complex aspects like provenance chains in analytics. This type of notion makes the ecosystem more manageable at higher levels for applications (albeit perhaps a more complex ecosystem of services).

In CEOS FDA there are varying views on how much of this type of "hiding" can be done and its impact on performance of applications and where the boundaries of services are. For example, if one agency uses netcdf and another COGS formats for custodial purposes in the same Cloud provider, how does this impact application development and frameworks like pangeo? How do custodial metadata stores have to change in order to support this broader ecosystem (e.g. duplicates of point-of-truth datasets across multiple Clouds).

CEOS WGISS sits in the bounds between the data providers (CEOS agencies) who are placing data into Cloud environments and associated services for Discovery and groups like Pangeo, who provide frameworks to do stuff with the data. I'd be interested in hearing more from Pangeo community on how things might work better and the architecture, both with my WGISS and ODC hats on.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

darothen commented 5 years ago

After seeing some of the interesting snippets on Twitter surrounding Google Earth Engine and various demos at AGU 2018, I thought it might be a good time to re-open this conversation. Were folks from the Pangeo/GEE community able to cross-pollinate and discuss ideas at AGU? Are there new opportunities for collaboration that might be emerging?

mrocklin commented 5 years ago

@lheagy and other Jupyter folks had a conversation with GEE folks semi-recently I think. I don't recall the outcome there though.

jhamman commented 5 years ago

@scottyhq and @tylere had some conversations this week. Maybe they have something to report here.

tylere commented 5 years ago

At AGU I was able to attend (most) of the 1/2 day Pangeo workshop, so I now have better understanding of what type of analyses might fit well in the Pangeo stack. Congratulations on a well run workshop... there was a high level of engagement from the audience, even thought there was a diverse range of experience with Python/Jupyter.

Later in the week @scottyhq https://github.com/scottyhq and I did get a chance to connect and briefly brainstormed on Pangeo-Earth Engine joint analysis ideas (although we got a bit sidetracked after discovering we are both part of the cryospheric community).

I still think it would be very useful to come up with a meaningful demo analysis that would be easier to accomplish using the combination of Pangeo and Earth Engine, compared to performing the same analysis on either system. The best idea I have come up with so far is around vegetation phenology where we could:

On Fri, Dec 14, 2018 at 6:42 PM Joe Hamman notifications@github.com wrote:

@scottyhq https://github.com/scottyhq and @tylere https://github.com/tylere had some conversations this week. Maybe they have something to report here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/216#issuecomment-447511407, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFqviXrb7G3OMTPgFD-zCdWbh9dsT9zks5u5Dd1gaJpZM4TU_2R .

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

tylere commented 5 years ago

Reopening this bug to see if anyone from the Pangeo community is interested in 4 days of intensive Earth Engine training at the Geo for Good summit in September. I suspect any Pangeo-savy developer who attends will develop ideas on how the Pangeo and Earth Engine communities could collaborate.

The application deadline is coming up... May 3.

Cheers, Tyler

---------- Forwarded message --------- From: Google Earth Engine Developers google-earth-engine-developers@googlegroups.com Date: Wed, Apr 17, 2019 at 3:19 PM Subject: Apply now for the 2019 Earth Engine User Summit! (deadline May 3) To: Google Earth Engine Developers google-earth-engine-developers@googlegroups.com

In case you are waiting for information on the next Earth Engine User Summit, note that for 2019:(Earth Engine User Summit) + (Geo for Good User Summit) = (Geo for Good Summit)

We are excited to announce we are unifying the Earth Engine User Summit and Earth Outreach’s Geo for Good User Summit for our first ever combined Geo for Good Summit. This Summit will bring together the Earth Engine and Earth Outreach communities to one larger event where scientists, nonprofits and changemakers can learn from each other and potentially collaborate on projects for positive impact for our planet and its inhabitants.

This year’s Geo for Good Summit is a special four-day conference and hands-on technical workshop hosted by the Google Earth Outreach and Earth Engine teams at the Google headquarters in Sunnyvale, California from September 16-19, 2019. The Summit is intended for technologists, GIS specialists, remote sensing specialists and others working on projects using Google Earth, Earth Engine and other Geo tools to create new knowledge, raise awareness and enable action.

What: Geo for Good Summit When: September 16-19, 2019 Where: Google Headquarters, Sunnyvale, California

To find out more and apply, visit g.co/earth/GeoforGood19. The deadline for applications is May 3, 2019.

Thank you, the Google Earth Engine and Google Earth Outreach teams

alxmrs commented 10 months ago

I just discovered this issue! Since the last post, a lot has changed. The EE team has improved their Python data APIs quite a bit. In my estimation, about a quarter of the proposed integrations are implemented: https://github.com/google/Xee exposes reading EE rasters as Xarray datasets. FeatureCollection integration with Pangeo is a missing half. For this, I happen to be working on a design: https://docs.google.com/document/d/1Ltl6XrZ_uGD2J7OW1roUsxhpFxx5QvNeInjuONIkmL8/edit TL;DR: EE FCs should be openable as Dask dataframes. I haven’t given much thought to how to write Dask dataframes to EE, but it should be tractable. (For writing Xarrays to EE, this should be possible if EE can read Zarr, which will be easier once the GeoZarr spec is more stable.)