pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
693 stars 187 forks source link

Pangeo-friendly toolkit for processing digital topographic data #707

Closed benbovy closed 4 years ago

benbovy commented 4 years ago

Hi pangeo,

Is anyone here playing with Digital Elevation Models (DEMs) and interested in leveraging the pangeo ecosystem for things like:

I guess this might interest scientists in various fields of Earth Science? I'm eager to start using pangeo for interactive analysis and modelling of topographic data in distributed environments, but we still miss tools for the processing tasks listed above (or maybe I'm missing pangeo-friendly tools that already exist?).

I've been working (and I'm still working) on different bits that might be re-used in this context (e.g., here and here). The richdem library has potential too. Does anyone know other tools that we could re-use as a backend?

I'd like to eventually put everything together nicely in a domain-specific package, which would for example provide a xarray.DataArray extension (accessor).

It would be great if everything could work with dask arrays too. For some of the algorithms (e.g., terrain derivatives) this is quite trivial, but for anything based on flow routing (graph-traversal) I find it much harder! Now that xarray and dask play nicely with pydata/sparse, things might get a little bit easier. I'm still struggling to implement it, though (see, e.g., this notebook -- if someone has a better idea on how to solve this, I'd be happy to discuss!).

I'd be happy to discuss here or elsewhere with those who are interested in this topic!

rabernat commented 4 years ago

I know that @jjspergel and @jkingslake are looking at some of these issues over in https://github.com/rabernat/pangeo-rema/. They want to use richdem with COGs and do parallel processing with dask.

dshean commented 4 years ago

Great to see a discussion on this topic!

We do a lot of work with time series of high-res DEMs, and are migrating some analysis to pangeo framework. Our application is more for elevation/volume change analysis for dynamic landscapes (e.g., glaciers, volcanoes, landslides), typically from temporally and spatially sparse observations (1000s of nodata-containing raster DEMs over several years, each with relatively small footprint) across a large spatial domain. This is the same situation for REMA/ArcticDEM strips. There were some discussions and sprints related to this topic at the recent pangeo meeting - looping in @scottyhq @friedrichknuth @ShashankBice @jomey.

Many of our solutions in the past relied on large GDAL vrt for lots of tiled COG DEM datasets. One can do a lot with the GDAL API/CLI and utilities like GNU Parallel, especially if processing large batches of individual DEMs, or parallelizing raster time series analysis of small areas (e.g., custom stacks of DEMs for 100K glacier polygons). But as you mention, these approaches are not necessarily well-suited for complex spatial operations over huge arrays (e.g., global, tiled DEM mosaic/composite like REMA/ArcticDEM mosaic, SRTM or TanDEM-X) while leveraging distributed resources. So we're very interested in this topic for future analysis.

Also worth mentioning another toolkit that we use for some DEM processing tasks - the Ames Stereo Pipeline package (https://github.com/NeoGeographyToolkit/StereoPipeline). Primary application is for stereo reconstruction, but there are also several useful CLI utilities with multi-threaded, memory-efficient tilling/caching for operations on 1000s of rasters and/or huge rasters. Check out dem_mosaic and image_calc, for example.

dshean commented 4 years ago

Just learned that ArcticDEM 2-m strips are now available through Google Earth Engine catalog: https://developers.google.com/earth-engine/datasets/catalog/UMN_PGC_ArcticDEM_V3_2m

These are v3, though official release is now up to v7. Should primarily be a difference in coverage, not processing workflow. See release section of https://www.pgc.umn.edu/data/arcticdem/

REMA strips for Antarctica are not yet available in GEE. The tiled mosaics are not yet available for either ArcticDEM or REMA.

dshean commented 4 years ago

Looks like the pangeo-rema notebooks are currently exploring a single tile from the REMA mosaic.

So it looks like we are still at the stage where we need to slurp, process, and store all of the PGC products, converting to COG or some other cloud-friendly format? Or has someone already done this?

rsignell-usgs commented 4 years ago

There are some tools also in the datashader package http://datashader.org/user_guide/8_Geography.html

Here's an notebook example of hillshade: https://nbviewer.jupyter.org/gist/rsignell-usgs/0f96bb9c0ca34a5dd0fc8131a7bbae1c 2019-09-04_15-54-10

rabernat commented 4 years ago

So it looks like we are still at the stage where we need to slurp, process, and store all of the PGC products, converting to COG or some other cloud-friendly format? Or has someone already done this?

This is exactly what @jjspergel and @jkingslake are supposedly doing. The idea was to demonstrate the value of "analysis ready data" for this field. I will keep tagging them until they surface. ;)

Just learned that ArcticDEM 2-m strips are now available through Google Earth Engine catalog:

We have discussed several times where we can directly access GEE assets from Pangeo. Not sure where that stands.

jkingslake commented 4 years ago

Hi all, This is great to see people are working on this (I am a beginner with Pangeo and all the packages, so have a lot to learn!).

@jjspergel has successfully converted all the REMA tiles to COGs and they are in a google bucket here.

Now we need to produce the correct json files to accompany them.

I know that @jjspergel has been using GEE for other things (Sentinel SAR), but i didn't realize ArcticDEM strips were in there. Do we know if there are plans to get REMA in there too?

benbovy commented 4 years ago

Great to read all your comments on both the processing and data management aspects of this problem!

I'm very interested in how you deal with data formats, conversion and storage. So far I've not been very concerned by this as I'm dealing mostly with data generated from landscape evolution models, but we'll soon need efficient solutions in this field too, e.g., for model inference from actual data using advanced (ML/DL) techniques.

There are some tools also in the datashader package

Good to know! I wasn't aware of that (I haven't used datashader much yet). I was also considering using numba for implementing the computationally expensive parts. Not sure how best this could be reused, though. A xarray-topotools -> datashader dependency would a-priori make more sense than a datashader -> xarray-topotools dependency, IMHO.

processing large batches of individual DEMs, or parallelizing raster time series analysis of small areas

This actually represents much of my current use-cases (i.e., post-processing batches of landscape evolution simulations). I think that a toolkit based on xarray/dask should handle this case efficiently. For some processing tasks (e.g., drainage area), we might need to maintain two implementations (one serial and efficient used with map_blocks vs. one highly parallel but less efficient) that we could internally choose based on how the array is chunked.

rsignell-usgs commented 4 years ago

@benbovy , if you generate some cool examples, please report back, as USGS is very interested in this type of work also!

benbovy commented 4 years ago

@rsignell-usgs , I will!

dshean commented 4 years ago

Great to see COG REMA tiles already living in a google bucket! Thanks for leading that effort @jjspergel and @jkingslake. Once the process is streamlined, it would be great to repeat for latest version of the ArcticDEM tiles. I started a conversation with PGC last week on whether they would be willing to export and host the 2-m strips as COG.

@benbovy - sounds great. It sounds like you're using model output that is continuous in space and time, which is well-suited for xarray. Things get more complicated with large collections of DEMs derived from satellite obs that have small footprints with variable spatial extent and lots of nodata.

@scottyhq has some working solutions for satellite images like this, resampling to prepare nD time series on a limited, common grid. This works for small grid extent, but I'm not sure if others have scaled to larger domains (like full Antarctic continent). Doing this will result in very large, very sparse arrays in space and time. With dask and lazy evaluation, this should be tractable, and I/O is likely going to be bottleneck reading in 1000s of relatively small files. Not sure about latest developments on these subjects, and unfortunately, I haven't had time to test myself.

davidbrochart commented 4 years ago

If that can be of any help, I've been working on these things (although it doesn't really take advantage of Pangeo):

jjspergel commented 4 years ago

The REMA DEM mosaic and strips are now available through Google Earth Engine. @Jkingslake and I might continue to work on uploading it ourselves, but I'm going to play around with REMA in the meantime in GEE to see if what I'm interested in doing can be done through there.

On Thu, Sep 12, 2019 at 1:08 AM David Brochart notifications@github.com wrote:

If that can be of any help, I've been working on these things (although it doesn't really take advantage of Pangeo):

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/707?email_source=notifications&email_token=AKH3SEOCH6QRTBBL6UJG3KTQJHFETA5CNFSM4ISO32O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6QVKLQ#issuecomment-530666798, or mute the thread https://github.com/notifications/unsubscribe-auth/AKH3SEOROWCO3WK3DGETIDLQJHFETANCNFSM4ISO32OQ .

-- Julian J. Spergel Lamont Doherty Earth Observatory University of Chicago '16 Geophysical Sciences

rabernat commented 4 years ago

I am still interested in an open-source-based approach to this problem. GEE is powerful but also limited in how it interoperates with other software. If GEE could solve all our problems, we would not have had to start Pangeo in the first place.

For anyone else interested in working on REMA from Pangeo, I have a very simple example binder which shows how to load up and process the data with xarray and dask: https://github.com/pangeo-data/pangeo-geospatial-examples

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

jbednar commented 4 years ago

@brendancol is interested in taking the geo toolkit from Datashader and making it into its own separate package for raster-based Numba-optimized computations on Xarray objects. It would still work nicely with Datashader, but such a project could then focus more directly on geographic applications, and would be a natural place to implement flow accumulation and other hydrology-related functions. I've agreed that he can move those items out of Datashader, but I'm not sure when he'll get time to do so. That said, I believe he's an independent contractor and so anyone who has a bit of money to throw his way could make him have time! :-)

brendancol commented 4 years ago

As @jbednar mentioned, I'm currently putting together a raster-based spatial analysis package which currently lacks a name (coming soon...). My thought was something like xarray-spatial or datashader-spatial. But would love input / feedback.

The mission of the library is: to provide numba + xarray-based implementations of common spatial analysis tools which are:

The first release will include the following analysis tools which take / return xarray.DataArrays (aka datashader aggregates):

I would love to add a hydrology tool to the short-term roadmap.

Other near term roadmap items include:

djhoese commented 4 years ago

Hi @brendancol, I was the guy on the Pangeo call today that mentioned the geoxarray project and how it exists but doesn't have anything filled in right now. You mentioned that some of these things might not be "geo" specific, but maybe that is OK depending on how these are implemented. If you'd like we can talk more on the pangeo gitter or create an issue in the geoxarray repository: https://github.com/geoxarray/geoxarray

rabernat commented 4 years ago

I personally would vote to keep the geoxarray name and repo for its originally intended purpose, which was mostly related to things like indexing, projections, etc. There is a lot of past discussion that refers to it. It would be confusing to repurpose it for some other, unrelated tool.

jbednar commented 4 years ago

xrspatial, then?

brendancol commented 4 years ago

@rabernat @jbednar I went with xarray-spatial but with the actual package being named xrspatial. I'm stubbing out things here and am preparing for a super-alpha release early next week: https://github.com/makepath/xarray-spatial

jbednar commented 4 years ago

Sounds good. Let's have a migration plan in place (i.e. a PR on datashader putting in appropriate stubs, etc.) before announcing it.

jbednar commented 4 years ago

See https://github.com/makepath/xarray-spatial/issues/17 for additional functions proposed for xarray-spatial.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.