ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

SciPy 2016 meeting #22

Open ChrisBarker-NOAA opened 8 years ago

ChrisBarker-NOAA commented 8 years ago

Hi folks,

A few of us got together to talk about where to take all of this. We didn't take good notes, but I thought I"d capture a gist of where we think we should head.

Attending: @chrisbarker-NOAA @hetland @jay-hennen @rsignell-usgs

Conclusions:

We should build a single high level object (that's the API of APIRUS) which users can query with a single API regardless of the underlying grid representation. This way a user can point it to a netcdf file or OpenDAP url, then see what variables rs there, and do things like ask for interpolated data at a point in space and time, get transects, etc....

Some of the core functionality is built into pyugrid an pysgrid, but it's not well organised at the moment. So, this Object (what to call it?) will be built up with via composition, so we can share code that's the same for different grids, while being able to plu in noew grid types, new interpolation algorithms, etc.

For the most part, at least in coastal and ocean modeling, the various horizontal grid types use the same schemes for time and vertical coordinates. so a complete composited object will have:

This will facilitate interpolation is the following manner:

the interpolator can then slice the needed data out of the variable(s) to be interpolated, and apply the various weights to get he interpolated value.

To be decided:

I think the the top-level object should essentially represent a full dataset -- what would be in a complete netcdf file. In fact, one could use it to load a non-standards compliant dataset and re-shave it out in a standard compliant way, or create a subset, and save it out fully consistent and compliant -- that sort of thing.

But is the top-level object the key object teh user works with, or does the user work with individual variable objects? i.e:

my_dataset.interpolate( (lon, lat, depth, time), 'salinity')

or

salinity = my_dataset.variables['salinity']
salinity.interpolate((lon, lat, depth, time), 'salinity'))

I'm on the fence about this -- as you may well want multiple variables handled, it makes some sense to access it all via the data_set object and be able to ask for multiple variables at once.

on the other hand, you may have different variables in different datasets, and it would be nice to have an api where you wouldn't need to know whether those variables are in the same data set or not..

what to call it?

My first thought: star_grid (i.e *grid) -- but that's likely to make folks think it's an astronomy project.

any_grid maybe? (or py_anygrid) -- separate issue for this: #21

ChrisBarker-NOAA commented 8 years ago

@cbcunc : thought you'd like to be in on this conversation.

-CHB

kwilcox commented 8 years ago

Was there any discussion if the base "grid" object was going to be a subclass of xrarray.Dataset, netCDF4.Dataset, or its own object that does its own on-disk and in-memory management?

ocefpaf commented 8 years ago

@kwilcox I believe so. If dask is available xarray will use it almost everything will be lazy loaded and chunked. Making it easier to load big data from disk and do operations with low memory.

However, the chunking is up to the user and the distributed computation is not available yet as far as I know.

kwilcox commented 8 years ago

OK, my vote is for xarray.Dataset, that is what I've been doing lately and it works out really well.

hetland commented 8 years ago

+1 for xarray.Dataset

It might be nice if the core dataset could also be a netCDF4.Dataset or an np.array, and this could be done by creating 'decorator' classes so that, e.g., dimensions were calculated in the same way. Or we could create a few generic functions to calculate grid size, e.g., of whatever object was supplied. These would only be called or used internally, not exposed, so that it would work on all the common data types. This would be useful if someone develops another datatype that we all love, and want to incorporate.

On Mon, Jul 18, 2016 at 8:13 AM, Kyle Wilcox notifications@github.com wrote:

OK, my vote is for xarray.Dataset, that is what I've been doing lately and it works out really well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ioos_APIRUS_issues_22-23issuecomment-2D233358749&d=CwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=FPERNJ38ToEXgd8OyE4aew&m=4mircUnceeQiaYQ8mEMgY2CJ75W-2tGARdhnneUXum0&s=fMXTXVSMkR0nA4F_nZ3WX6adYFLwbKY2rlIUpx1JeTE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACesFoGEGDVSx7DULxK6-5FivoqGfvu0-5Foks5qW5gagaJpZM4JNuzl&d=CwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=FPERNJ38ToEXgd8OyE4aew&m=4mircUnceeQiaYQ8mEMgY2CJ75W-2tGARdhnneUXum0&s=hH9GP-ntxoHElwAkulsAegNShEu5_wyoX0AwRPL8wFA&e= .

Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob

rsignell-usgs commented 8 years ago

Bringing @pelson in here also, as we discussed this a bit as well at SciPy 2016 (and at the airport before heading home...)

rsignell-usgs commented 8 years ago

BTW, @pelson is now co-leading development of Iris (Richard Hattersley left Iris to pursue other projects at the Met Office).

ChrisBarker-NOAA commented 8 years ago

On Mon, Jul 18, 2016 at 7:52 AM, Kyle Wilcox notifications@github.com wrote:

Was there any discussion if the base "grid" objects

Do you mean what more-or-less maps to a variable? i.e the object that represents a field of a particular quantity? Or the whole shebang?

was going to be a subclass of xrarray.Dataset, netCDF4.Dataset, or its own object that does its own on-disk and in-memory management?

probably not a subclass of anything. And definitely not a netCDF4.Dataset.

I'm thinking that it'll probably be its own thing, maybe with a similar API to an xarray Dataset where applicable.

Under the hood, the raw data itself needs to be in SOME kind of array object -- I"m hoping that can be any numpy array-like object:

numpy arrays netCDF4 variables dask arrays

(Filipe as almost convinced me that dask is lightweight an reliable enough that we maybe able to require that...)

I guess I need to poke into xarray more and see how well the API maps.

I will say that netCDF4 made some "clever" API decisions that I don't really like -- like mingling variable attributes and python object attributes. IT's really nifty that you can do:

my_variable.units

And get the units, but then when you have a nc_attribute that isn't a legal python identifier, or might clash with an python object attribute, it gets ugly. So I"d rather have the simpler:

my_variable.attrs['units']

I think xarray does similar stuff in that way, too ( :-( )

The point: it would be nice to have the freedom to define our own API, though I do see the point of matching existing APIS for use abilities sake, too.

ChrisBarker-NOAA commented 8 years ago

On Mon, Jul 18, 2016 at 8:22 AM, Rob Hetland notifications@github.com wrote:

+1 for xarray.Dataset

It might be nice if the core dataset could also be a netCDF4.Dataset or an np.array, and this could be done by creating 'decorator' classes so that, e.g., dimensions were calculated in the same way.

hmm -- I think it not going t work to have the gridfu.Dataset Be anything else -- i.e. subclass. but anyone is free to hack together some prototypes and see how ti works.

I was hoping to do that at the SciPy sprints, but ended up dealing with updating tests in pysgrid and dealing with ugly segfaults in netcdf instead :-(

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

jsignell commented 8 years ago

My dad pulled me in. @ChrisBarker-NOAA you can call units either way just like in pandas. These are equivalent: my_variable.attrs['units'] == my_variable.units

@ocefpaf In terms of lazy computations there is some functionality in xarray.ufuncs. I put up a little gist here: https://gist.github.com/jsignell/102c343a361a80725c83cd1db6e834b6

ChrisBarker-NOAA commented 8 years ago

thanks @jsignell -- though I"d probably disable the direct attribute access if I could:

"There should be one-- and preferably only one --obvious way to do it."

But anyway, that's just one example -- the question at hand is, do we want to be locked into an API that's already been defined, or have flexibility to make it better / more suited to the problem at hand?

jay-hennen commented 8 years ago

I don't have a good use case for this API I can think with yet. GNOME's environment.grid_property.GridProp objects are probably the best example of one of these 'higher level' objects I can think of.

To explain. a GridProp object has several key attributes:

  1. A grid object (pysgrid or pyugrid)
  2. A time object (gnome.environment.property.Time)
  3. One or more variable/data object (netCDF4.Variable or numpy.array or other array-like)

The core use of these objects is to provide 'What is the value of the variable(s) at p points, at t time)'. The GridProp object uses the py*grid object to interpolate_var_to_points to get 2D interpolated values for t1, repeat for t2, and then apply the time interpolation weights from the Time object and return the interpolated results.

This is a nice and narrow use case, but is also applicable to a lot of problems. I get the impression that this is not the sort of thing we are talking about here, however.

The way I see it, netCDF4.Dataset or xarray.Dataset already provide holistic data representation/provision service. Is py_anygrid going to have the same role, but with some extra grid-specific functionality tacked on?

Sorry for the wandering in this post, but it accurately reflects my ignorance about the purpose of something like py_anygrid

ocefpaf commented 8 years ago

My dad pulled me in. @ChrisBarker-NOAA you can call units either way just like in pandas. These are equivalent: my_variable.attrs['units'] == my_variable.units

Yep. That is the way to go IMO to be friendly with those who want to explore with tab. or write a "clear" code.

@ocefpaf In terms of lazy computations there is some functionality in xarray.ufuncs. I put up a little gist here: https://gist.github.com/jsignell/102c343a361a80725c83cd1db6e834b6

Nice example! I am not sure about distributed computation though.

ChrisBarker-NOAA commented 8 years ago

On Mon, Jul 18, 2016 at 10:56 AM, jay-hennen notifications@github.com wrote:

GNOME's environment.grid_property.GridProp objects are probably the best example of one of these 'higher level' objects I can think of.

To explain. a GridProp object has several key attributes:

  1. A grid object (pysgrid or pyugrid)

    which would not be part of the "public" API -- at least not for common use.

  2. A time object (gnome.environment.property.Time)
  3. One or more variable/data object (netCDF4.Variable or numpy.array or other array-like)

The core use of these objects is to provide 'What is the value of the variable(s) at p points, at t time)'. The GridProp object uses the py*grid object to interpolate_var_to_points to get 2D interpolated values for t1, repeat for t2, and then apply the time interpolation weights from the Time object and return the interpolated results.

This is a nice and narrow use case, but is also applicable to a lot of problems. I get the impression that this is not the sort of thing we are talking about here, however.

no -- it is -- the goal is to move the time (and depth) objects into the DataSet object (and/or whatever the variable object is) so that we have a single API where you can query for interpolated values in (lon, lat,depth, time) coordinates.

The way I see it, netCDF4.Dataset or xarray.Dataset already provide holistic data representation/provision service. Is py_anygrid going to have the same role, but with some extra grid-specific functionality tacked on?

that's pretty much idea -- though I'm hoping it's "integrated", rather than "tacked on"

-CHB

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov