pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
703 stars 189 forks source link

Connections with Open Data Cube? #96

Closed jhamman closed 5 years ago

jhamman commented 6 years ago

Do we have any Pangeo members that are working with the Open Data Cube project: https://www.opendatacube.org/?

Looking through their repositories, it seems like they are building out a very similar xarray/dask stack, with a focus on remote sensing datasets.

I will cross post on their main repository as well: https://github.com/opendatacube/datacube-core

hot007 commented 6 years ago

I used to be associated... I'll ping one of their devs :)

rabernat commented 6 years ago

I exchanged emails with Brian Killough of NASA after ESIP. He is involved in open data cube via CEOS at some high level. He made some nice comments about our project and expressed a willingness to collaborate.

andrewdhicks commented 6 years ago

Hi @jhamman I'm one of the devs on Open Data Cube. We are big fans of xarray and dask.

Open Data Cube is basically a database of datasets that can return an xarray, with a bunch of algorithms and applications that are built around xarray. We have members from quite a few countries so we are keen to collaborate. Making it easier to merge climate and remote sensing datasets would be great.

(Thanks @hot007 for the ping)

mrocklin commented 6 years ago

Thank you for chiming in @andrewdhicks . I'm glad that you've been able to put XArray and Dask to some use. I have two questions from a dask perspective:

  1. Have you tried things on a distributed cluster? Do you have any reason to believe that this would be easier or harder than in vanilla XArray cases?
  2. What computational operations would you want to efficiently and intuitively merge datasets?
jhamman commented 6 years ago

xref on Open Data Cube github issues: https://github.com/opendatacube/datacube-core/issues/359

andrewdhicks commented 6 years ago

Hi @mrocklin We tried to use distributed in its early days on HPC, but we found that it was too easy to trip the memory quota limits are have the job killed. Instead of dask.array, we have using distributed as a futures backend to run jobs hand-crafted to known memory limits across multiple workers. The job then uses vanilla xarray, but it is a pain to do this hand-crafting to a memory footprint for every algorithm, so I expect dask arrays to make life easier. With the new features in distributed to set memory limits and spill to disk, I'm really excited to try distributed again. I just need to do a little bit more work to get it running with the right memory settings on our PBS Pro cluster.

There are a couple of main operations we do:

As for merging datasets, a lot of it involves datasets of different spatial and temporal resolutions and doing interpolation between them. I'm not too involved in that side of things at the moment. There's generally some projection warping to get everything aligned first, so the first step is getting the rasterio reads working efficiently in the dask graph.

guillaumeeb commented 6 years ago

Hi all, We tried Open Data Cube with Dask distributed on our PBS Pro based cluster at CNES in the end of last year with some success and lessons:

At the beginning we wanted to develop Spark workloads with Open Data Cube, but when we found it was all based on Dask and XArray, we just decided to use those.

Open Data Cube is really promising for us as a distributed multi dimentional data store for image analysis. As all of you have already observed, NetCDF and distributed storage and computing is still an open question, so we are trying to see if ODC can be an answer.

@andrewdhicks there are new resources for running Dask on PBSPro: a huge work from @mrocklin on pangeo (with dask-mpi), and we also worked on a simple PBS script available here.

mrocklin commented 6 years ago

These docs may also be of use to people using Dask on HPC systems. It goes through a number of lessons that we learned more generally: http://dask.pydata.org/en/latest/setup/hpc.html

On Thu, Feb 1, 2018 at 8:48 AM, Guillaume EB notifications@github.com wrote:

Hi all, We tried Open Data Cube with Dask distributed on our PBS Pro based cluster at CNES in the end of last year with some success and lessons:

  • Distributed ingestion into the Data Cube format was really easy,
  • We had some memory problems with our Dask workers, but it was probably due to a lack of knowledge of Dask, now I think we are fine.
  • We have some difficulties into knowing when to use Data Cube API, or when and how to use Dask/Xarray ones, but again this is probably a lack of mastering of these technologies.
  • We are know in phase of looking for more precise experimental use cases and operations that could leverage the distributed nature of Dask/Xarray and ODC.

At the beginning we wanted to develop Spark workloads with Open Data Cube, but when we found it was all based on Dask and XArray, we just decided to use those.

Open Data Cube is really promising for us as a distributed multi dimentional data store for image analysis. As all of you have already observed, NetCDF and distributed storage and computing is still an open question, so we are trying to see if ODC can be an answer.

@andrewdhicks https://github.com/andrewdhicks there are new resources for running Dask on PBSPro: a huge work from @mrocklin https://github.com/mrocklin on pangeo (with dask-mpi), and we also worked on a simple PBS script available here https://github.com/guillaumeeb/big-data-frameworks-on-pbs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/96#issuecomment-362270676, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszGahSxzPDuuZvdCkza4Qa9Rl5NrEks5tQcCmgaJpZM4RzTGD .

guillaumeeb commented 6 years ago

And I just see now the work you've done here on this pangeo repo with the Python wrapper to launch Dask on PBS! With adaptable capacity! I will try to follow this more closely. Don't hesitate to keep me in touch on Dask and PBS!

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

rsignell-usgs commented 5 years ago

Following the discussion of ODC on this week's Pangeo call, @jmunroe reached out to Robert Woodcock @woodcockr at CSIRO, and I looped in Alex Leith @alexgleith at Geoscience Australia, who also recently wrote a nice blog post on recent ODC developments.

They have COG data on S3, and JupyterHub running on Kubernetes with Dask (I believe). So it seems there is a lot of opportunity for increasing connections between our communities. Both Robert and Alex would be willing to meet with us on our weekly Pangeo check-in, which we suggested they do after Ryan returns.

Robert also said this:

I’d also like to know if a representative from Pangeo might be willing to present at the next (long title coming) Committee on Earth Observation Satellites, Working Group on Information Systems and Services meeting (WGISS - Hanoi in October)? I’m the vice chair of WGISS and we are working towards updating the manner in which EO data is provided by major satellite data providers (USGS included). This is referred to as the Future Data Architecture initiative in CEOS and involves everything from Data discovery and supply, through to links into Cloud providers and analytical capabilities (of which ODC is one example, but not the only datacube technology). I believe many folks would be interested in the Pangeo ecosystem, intake and the complementary nature to the various data cube initiatives underway.

At a very practical level, I’d be interested if anyone in Pangeo knows why I’m getting assertion failures in dask distributed when running on an adaptive k8s cluster? The result appears correct but dask is complaining. I’m guessing the worker pods were killed before full clean up. I’m not alone with error judging by the issues on github. Low level I know, so a tad off topic for this thread but who knows you may have hit it too. ;-)

Clearly we should reopen this issue!

alexgleith commented 5 years ago

Definitely keen to come to a Pangeo meeting. I was inspired to attempt to use Dask with the ODC after using some of the Pangeo demos. I think that the two projects are very close to each other, and we're talking currently about redesigning some aspects of the ODC database structure, and I wonder if talking to Pangeo will provide some context for us to focus on there too.

rsignell-usgs commented 5 years ago

@alexgleith, I think it would be cool to take one of the data intensive ODC workflows, like the creation of the average high-tide/low-tide composite The-high-tide-composite-left-panel-and-low-tide-composite-right-panel-median-surface and see what that workflow looks like using just Dask/Xarray/Rasterio on Kubernetes and using Pyviz for interactive visualization).

Is the data necessary for that calculation already in COG on S3?

cc @tonybutzer

alexgleith commented 5 years ago

Hey @rsignell-usgs, I was inspired by the Twitter thread that identified ships in Sentinel 1 data, and we (@caitlinadams and I) worked it into a notebook that uses Dask in a neat way, here: https://github.com/frontiersi/dea-sandbox-notebooks/blob/master/casestudy_shipping_dask.ipynb

I'll email you a link to our Sandbox where you can run it.

Regarding coastline extractions and high tide work, we have some Landsat data that can be used for it, and a notebook here: https://github.com/frontiersi/dea-sandbox-notebooks/blob/master/Coastal_erosion_monitoring.ipynb. This one doesn't use Dask... Geoscience Australia run this for all of Australia in their super computer, and it uses a different kind of distributed workflow.

Email incoming.

rsignell-usgs commented 5 years ago

@alexgleith, yes, I was able to access and run both of those notebooks on your JupyterHub -- they are awesome. The coastal erosion one is very closely aligned with our work at USGS in the Coastal and Marine Program, so I'll share it around!

I had some trouble trying to find the bucket for the Landsat data ls8_nbar_scene being used, which I was interested in to see if we could hit that directly. Could you please provide the URL?

alexgleith commented 5 years ago

Hey Folks, how do I find my way into one of the Pangeo calls?

fmaussion commented 5 years ago

Hey Folks, how do I find my way into one of the Pangeo calls?

http://pangeo.io/meeting-notes.html#pangeo-weekly-meeting-notes

I'll try to join for my first meeting next Wed as well

woodcockr commented 5 years ago

Hey folks, I'll endeavour to get along to the next "Late Week" meeting as that's at somewhat more reasonable hour in my timezone.

alexgleith commented 5 years ago

I think I'm with @woodcockr and I'll aim for the meeting next week, which is at 6 am local time. 2 am does not sound awesome!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.