Using s3 and new intake-xarray plugin

jsignell commented 5 years ago

[x] @jsignell: Add Intake command or other appropriate mechanism to get the Keras training data to that notebook.

jbednar commented 5 years ago

What are the pros/cons/difficulties/limitations in using intake to do the fetching and caching?

jsignell commented 5 years ago

What are the pros/cons/difficulties/limitations in using intake to do the fetching and caching?

It wouldn't be hard, just doesn't necessarily seem like best practice to me.

Intake

How: Read a file dataset with intake and then use gv.from_xarray

Pros:

could use templating to return a specific class/number
could cache in intake cache

Cons:

intake cache gets broken easily and the best thing is to blow it away
look like we are selling intake too hard
we would still have a call to intake for each file, so this isn't really something that intake excels at.

Geoviews

Pros:

logic is already encapsulated in gv.RGB.load_tiff
cache as needed rather than all at the beginning - get up and running quickly

Cons:

using remote images takes considerably longer
is it weird to load files with a plotting library? Might be overly viz-focussed.

ebo commented 5 years ago

I am not sure I have something definitive to interject, but I would love to have the conversation.

I do not know the interaction of intake and xarray. Regardless, the functionality is necessary wherever it is actually implemented. So I would propose, if I can be so bold, is a) ask where is the best place for this functionality, and b) lets develop some examples for how to use it...

I cannot comment on the above, but it is germane to my work.

jbednar commented 5 years ago

From what I can see, gv.load_tiff doesn't do anything much, just da = xr.open_rasterio(filename) ; return from_xarray(da, crs, apply_transform, nan_nodata, **kwargs); seems like gv.from_xarray is doing the heavy lifting with or without intake.

My main concerns are:

The logic in this PR appears to be switching to a solution that's many times slower if there's no local copy of the data, which I don't think is a good practice. I think it's a better "best practice" recommendation to use some form of caching, even though that does fill up the hard drive, because in practice anyone doing real work ends up re-running the same notebook over and over. I want our examples to be something we really think people should be doing in practice.
I don't like how much code this solution requires per notebook. In real life, people generate tons of notebooks all over the place, and I don't want to encourage copying any substantial blocks of code like that for every notebook that uses a given set of data. I want the logic like that to be encapsulated somewhere (whether that's in intake, xarray, or some other suitable library) and referred to in a notebook. Notebooks should have invocations of code, declarations, single-use calculations, and sketches of new code that will later migrate to a library; they should not have to have distracting boilerplate sections of code that get copied between notebooks ad infinitum, each time getting some minor variation.
From the descriptions above, I can't tell what functionality is missing from intake or xarray to avoid having to have this big block of logic in each notebook. E.g. why do we need to have a call to intake per file, instead of being able to get a collection of them? Is that some functionality that's missing from intake?

jsignell commented 5 years ago

Depends on https://github.com/ContinuumIO/intake-xarray/pull/25 and https://github.com/ContinuumIO/intake/pull/221

holoviz-topics / EarthML

Using s3 and new intake-xarray plugin #76

Intake

Geoviews