holoviz-topics / EarthML

Tools for working with machine learning in earth science
https://earthml.holoviz.org
BSD 3-Clause "New" or "Revised" License
94 stars 21 forks source link

Using s3 and new intake-xarray plugin #76

Closed jsignell closed 5 years ago

jsignell commented 5 years ago
jbednar commented 5 years ago

What are the pros/cons/difficulties/limitations in using intake to do the fetching and caching?

jsignell commented 5 years ago

What are the pros/cons/difficulties/limitations in using intake to do the fetching and caching?

It wouldn't be hard, just doesn't necessarily seem like best practice to me.

Intake

How: Read a file dataset with intake and then use gv.from_xarray

Pros:

Cons:

Geoviews

Pros:

Cons:

ebo commented 5 years ago

I am not sure I have something definitive to interject, but I would love to have the conversation.

I do not know the interaction of intake and xarray. Regardless, the functionality is necessary wherever it is actually implemented. So I would propose, if I can be so bold, is a) ask where is the best place for this functionality, and b) lets develop some examples for how to use it...

I cannot comment on the above, but it is germane to my work.

jbednar commented 5 years ago

From what I can see, gv.load_tiff doesn't do anything much, just da = xr.open_rasterio(filename) ; return from_xarray(da, crs, apply_transform, nan_nodata, **kwargs); seems like gv.from_xarray is doing the heavy lifting with or without intake.

My main concerns are:

  1. The logic in this PR appears to be switching to a solution that's many times slower if there's no local copy of the data, which I don't think is a good practice. I think it's a better "best practice" recommendation to use some form of caching, even though that does fill up the hard drive, because in practice anyone doing real work ends up re-running the same notebook over and over. I want our examples to be something we really think people should be doing in practice.
  2. I don't like how much code this solution requires per notebook. In real life, people generate tons of notebooks all over the place, and I don't want to encourage copying any substantial blocks of code like that for every notebook that uses a given set of data. I want the logic like that to be encapsulated somewhere (whether that's in intake, xarray, or some other suitable library) and referred to in a notebook. Notebooks should have invocations of code, declarations, single-use calculations, and sketches of new code that will later migrate to a library; they should not have to have distracting boilerplate sections of code that get copied between notebooks ad infinitum, each time getting some minor variation.
  3. From the descriptions above, I can't tell what functionality is missing from intake or xarray to avoid having to have this big block of logic in each notebook. E.g. why do we need to have a call to intake per file, instead of being able to get a collection of them? Is that some functionality that's missing from intake?
jsignell commented 5 years ago

Depends on https://github.com/ContinuumIO/intake-xarray/pull/25 and https://github.com/ContinuumIO/intake/pull/221