Desirable Mission creep: *grid API?

ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)

Creative Commons Zero v1.0 Universal

2 stars 1 forks source link

Desirable Mission creep: *grid API? #13

Open ChrisBarker-NOAA opened 9 years ago

ChrisBarker-NOAA commented 9 years ago

In #1 , @hetland wrote:

""" In theory this could also be done for, say, cartesian to curvilinear, in addition to structured to unstructured. """ which brings up a point -- this all started as a discussion about an joint API between ugrids and sgrids -- but maybe should be expanded to *grids: regular grids, curvilinear grids that are not staggered, etc...

I think many of us were thinking of it in that way already, but we should probably be clear about it.

ChrisBarker-NOAA commented 8 years ago

Adding a bit more here:

It seem like what we probably all want is a SINGLE API to get data from arbitrary grids. So, n theory, one could do somethign like:

salinity = apirus.load_variable('salinity')

salinity.plot_transect( lon=(-17, -165), lat = (27, 28), depth=(0, -inf ) )

and you'd get a nifty plot.

Without needing to know what type of grid the data are one, etc.

Lots to work out for teh API -- but is this the goal?

Or are we counting on IRIS, etc to provide something like this, and all we want to do is support the grid navigation parts, so they can be used?

Part of this comes down to the overall model: In UGRid now, it's not at all fleshed out but the basic model is that the user will work with a "bunch of data on a grid". ON the other hand, the Iris model is that the user works with a single data type -- e.g. salinity, and the grid info is all hidden under the hood.

I'm a bit on the fence here -- I tend to think of a whole set of associated data all on the same grid, and often want to know a bit about the grid -- also -- you certainly want all the variables you are working with to use the same grid object.

So where are we headed here?

-CHB

hetland commented 8 years ago

Yes, I think so, and I don't think this needs any extra code to handle arbitrary grids, but rather just requires similar object composition (similar to duck typing, but from the other way, see https://en.wikipedia.org/wiki/Composition_over_inheritance). In this case, salinity could be associated with a very specific grid -- load variable was just a wrapper (a Factory Method?) that selected which class it would be, then object composition ensures a 'plot_transect' method.

On Fri, Oct 23, 2015 at 10:51 AM, Chris Barker notifications@github.com wrote:

Adding a bit more here:

It seem like what we probably all want is a SINGLE API to get data from arbitrary grids. So, n theory, one could do somethign like:

salinity = apirus.load_variable('salinity')

salinity.plot_transect( lon=(-17, -165), lat = (27, 28), depth=(0, -inf ) )

and you'd get a nifty plot.

Without needing to know what type of grid the data are one, etc.

Lots to work out for teh API -- but is this the goal?

Or are we counting on IRIS, etc to provide something like this, and all we want to do is support the grid navigation parts, so they can be used?

Part of this comes down to the overall model: In UGRid now, it's not at all fleshed out but the basic model is that the user will work with a "bunch of data on a grid". ON the other hand, the Iris model is that the user works with a single data type -- e.g. salinity, and the grid info is all hidden under the hood.

I'm a bit on the fence here -- I tend to think of a whole set of associated data all on the same grid, and often want to know a bit about the grid -- also -- you certainly want all the variables you are working with to use the same grid object.

So where are we headed here?

-CHB

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/13#issuecomment-150614197.

Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob

ChrisBarker-NOAA commented 8 years ago

Agreed - not extra code, but a consistent API. And, I think an API that is focused on the variable, rather than the grid.

though maybe not -- I guess you could still point to a source and get a grid object:

data = apirus.load_grid(some_file_or_url)

data.variables['salinity'].plot_transect(...)

or something like that.

Does sgrid have this already?

rsignell-usgs commented 8 years ago

Since we are speaking conceptually in this thread, what we are really talking about is not a new APIRUS package, but just modifying the pysgrid, pyugrid packages so we can have generic variable/grid objects that will have similar methods and attributes, right?

hetland commented 8 years ago

That would be my understanding, yes.

It is possible that a separate 'wrapper' class be created with the appropriate Factory Methods to query the file and return the appropriate grid class. But this would be a fairly thin layer, I think. The majority of the work would still take place in the *grid packages.

On Thu, Oct 29, 2015 at 11:32 AM, Rich Signell notifications@github.com wrote:

Since we are speaking conceptually in this thread, what we are really talking about is not a new APIRUS package, but just modifying pysgrid, pyugrid packages so we can have a generic grid object that will have similar methods and attributes, right?

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/13#issuecomment-152237012.

Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob

ocefpaf commented 8 years ago

The majority of the work would still take place in the *grid packages.

:+1:

ChrisBarker-NOAA commented 8 years ago

maybe more than a thin wrapper -- iisn't the work @ocefpaf has been doing with the vertical slicing, etc. applicable to all?

Back in the day, we talked about one of the goals of UGRID would be to supply what Iris needed to be UGRID aware, but not duplicate functionality that IRIS provides.

Now I think maybe we are kind of re-implementing Iris -- but only the parts of it we want. which is what the "mission creep" in this issue meant. But that does mean that there is a fair bit of shared code to put somewhere.

I'm thinking that we need a "variable" object of some sort, for instance. Maybe it's jsut an API that all the *grids support, but I image there is some code to be shared.

hetland commented 8 years ago

I think that majority of the code dealing with vertical slicing, etc would still live in the respective *grid classes, and would be called as methods to variables. This could also have a shallow wrapper around it, if needed.

On Thu, Oct 29, 2015 at 2:24 PM, Chris Barker notifications@github.com wrote:

maybe more than a thin wrapper -- iisn't the work @ocefpaf https://github.com/ocefpaf has been doing with the vertical slicing, etc. applicable to all?

Back in the day, we talked about one of the goals of UGRID would be to supply what Iris needed to be UGRID aware, but not duplicate functionality that IRIS provides.

Now I think maybe we are kind of re-implementing Iris -- but only the parts of it we want. which is what the "mission creep" in this issue meant. But that does mean that there is a fair bit of shared code to put somewhere.

I'm thinking that we need a "variable" object of some sort, for instance. Maybe it's jsut an API that all the *grids support, but I image there is some code to be shared.

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/13#issuecomment-152293648.

Prof. Rob Hetland Texas A&M Univ. – Dept. of Oceanography http://pong.tamu.edu/~rob

ocefpaf commented 8 years ago

isn't the work @ocefpaf has been doing with the vertical slicing, etc. applicable to all?

Yes.

Now I think maybe we are kind of re-implementing Iris

I don't think so. We are trying to make the API clearer and more generic so iris and any other library can use both pyugrid and pysgrid in a consistent way.

but only the parts of it we want.

Which are?

Take the vertical slicing (ciso), for example, that does not exist in iris. You can slice a regular cube (non-ugrid and non-curvilinear/sgrid) and interpolate a level. But that is it! In fact iris is lacking a lot of functionality when it for the most commonly used ocean grids.

I admit that odvc shares code with iris, but that is not supposed to be a library, it is just part of the exercise to gather the tools we need. Other libraries will need to build the vertical coordinate and having iris as a dependency is not an option. Ideally, if odvc does became a library, iris can use it too. Maybe the logic in odvc will end up in both pysgrid and pyugrid, maybe not. We'll see once we get there.

I'm thinking that we need a "variable" object of some sort, for instance. Maybe it's jsut an API that all the *grids support, but I image there is some code to be shared.

Sharing code is a possibility, but we still need to work hard on the base to get there.

With all that said. We need to start assigning tasks. I will take pysgrid load_grid() and migrating from pair_arrays to lon, lat. (I am counting on @ayan-usgs for some assistance :wink: )

ChrisBarker-NOAA commented 8 years ago

On Thu, Oct 29, 2015 at 1:42 PM, Rob Hetland notifications@github.com wrote:

I think that majority of the code dealing with vertical slicing, etc would still live in the respective *grid classes, and would be called as methods to variables.

Isn't the vertical stuff the same? better to share code, yes?

Also time would be the same.

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

ChrisBarker-NOAA commented 8 years ago

Now I think maybe we are kind of re-implementing Iris I don't think so. We are trying to make the API clearer and more generic so iris and any other library can use both pyugrid and pysgrid in a consistent way.

OK -- good goal there but plotting? and ???

but only the parts of it we want. Which are?

honestly, I have no idea -- still haven't really used iris. my primary use cases are more get values at particular places -- ultimately to drive a particle tracker -- not really Iris at all.

Take the vertical slicing (ciso), for example, that does not exist in iris. You can slice a regular cube (non-ugrid and non-curvilinear/sgrid) and interpolate a level. But that is it!

Ah -- I thought Iris did handle the vertical OK.

so good stuff.

ChrisBarker-NOAA commented 8 years ago

Sharing code is a possibility, but we still need to work hard on the base to get there.

yup -- I'm a fan of writing the code to do stuff, and see what API falls out.

ChrisBarker-NOAA commented 8 years ago

OK -- time to wake this thread up again:

@ChrisBarker-NOAA wrote:

I'm a fan of writing the code to do stuff, and see what API falls out.

So -- we've been doing that. This PR in py_ugrid: (actually written by @jay-hennen on the GNOME team)

https://github.com/pyugrid/pyugrid/pull/111

implements linear interpolation on a triangular mesh grid. And the bigger piece is that to do that you need to do the whole "what cell am I in" thing, so we've hooked it up with a CellTree implimentation:

https://github.com/NOAA-ORR-ERD/cell_tree2d

To do that part (optional dependency at this point, but if you work with grids more than a dozen cells, you really need a good spatial index to search it.

But now we want to really use the thing. In this case, we want to make it usable in a particle tracking model, but that should require similar features as other use cases. So my thoughts on the API now are:

The interpolation, etc. goes in UGrid -- that is specific to the grid type, etc.

For the most part we want to work with an object that represents the variable itself -- an abstraction of a field of a value -- in this case the py_ugrid UVar object. So that object needs reference to the grid that the data are on, so it can do interpolation, etc.

All good, but a couple issues:

1) for the use case at present, we need velocity -- a vector quantity that is stored in two variables in the netcdf, and in the default UVar implementation -- but we really want to be able to work with it as a single thing, not the least reason because it would be unfortunate to have to do the whole find the cell, compute the interpolation weights, etc, twice. So we're building a vector_var object of some sort. [now that I think about it, maybe there should be a generic multi-var API -- it's not uncommon to want two variables at the same time and place, maybe temp and salinity, etc...]

2) in this case, you kind of want to work with a given variable, and the grid underneath is more or less an implementation detail. But often you want to work with multiple variables, at once (see above), and sometimes with the whole darn dataset at once. After all we put all this stuff into one netcdf file for a reason -- a bunch of stuff all related, on the same grid, etc....

My first vision was that a UGrid object WAS the grid, AND all the data associated with it. hence the UGrid.data attribute. This might be fine way to go -- you load up a netcdf file, then you grab the UVAR you want and work with that. We need to be careful about circular references, but other than that, why not? But, the goal of APIRUS is to have a unified API regardless of grid type. So it may make more sense to have a generic container object for a grid and the data associated with it. We could just keep the API of UGRid, and sgrid, and whatever else, the same, but as I suspect there is a non-trivial amount of code that would be the same, so better to subclass them all from the same object. Better way to make sure the APIs match, too.

While I'm at it, the variable object should perhaps be shared as well -- dealing with time and the vertical will be essentially the same, so why duplicate that code?

So that leaves us with:

1) Some kind of variable object. UVar and ??? as far as I can tell, in it's current state, you work with raw netcdf variable in sgrid.

2) The grid object: UGrid and SGrid2D -- these should share an API but probably be duck-typed (or ABC?) (though they could share some utility code for dealing with netcdf files, etc....

3) A container_of_variables object -- this would essentially map to what is usually in a netcdf file -- the grid info and the variables this is what you'd load:

ContainerOfVariables,from_netcf(the filename_or_url)

(OK -- we need a better name for this....)

This could handle keeping multiple variables that are on the same grid in sync, etc, etc. KInd of like a cube list, but with more functionality.

Note: I'm thinking that code for dealing with time and maybe the vertical would be in this container. Or maybe a TIme object and VerticalCoordinate object, similar to, but simpler than, the UGRid and SGrid objects. Maybe time is simple enough to just build into the variable...)

In short -- I'm thinking we need to stop with the separate projects, and start the unified *grid project!

@rsignell-usgs, @ocefpaf, @hetland : what do you think? We're trying to get operational code gong here, so we need to nail this down soon, or we'll be off and running with something only we want to use. and I'd really like someone else to write the code for sgrid and the vertical coordinate for us -- so I don't want that :-)

ocefpaf commented 8 years ago

https://github.com/NOAA-ORR-ERD/cell_tree2d

I added a recipe for that (https://github.com/ioos/conda-recipes/pull/632) and we can try it out as soon as https://github.com/pyugrid/pyugrid/pull/111 gets merged.

But, the goal of APIRUS is to have a unified API regardless of grid type. So it may make more sense to have a generic container object for a grid and the data associated with it. We could just keep the API of UGRid, and sgrid, and whatever else, the same, but as I suspect there is a non-trivial amount of code that would be the same, so better to subclass them all from the same object. Better way to make sure the APIs match, too.

That's the dream :wink:

While I'm at it, the variable object should perhaps be shared as well -- dealing with time and the vertical will be essentially the same, so why duplicate that code?

Or why write that code? Maybe we cannot avoid when it is grid specific stuff...

1) Some kind of variable object. UVar and ??? as far as I can tell, in it's current state, you work with raw netcdf variable in sgrid. 2) The grid object: UGrid and SGrid2D -- these should share an API but probably be duck-typed (or ABC?) (though they could share some utility code for dealing with netcdf files, etc.... 3) A container_of_variables object -- this would essentially map to what is usually in a netcdf file -- the grid info and the variables this is what you'd load:

This could handle keeping multiple variables that are on the same grid in sync, etc, etc. KInd of like a cube list, but with more functionality.

When I think about that problem I end up with the same conclusion as you. I do not like the cube list, but that is the most sensible way to tackle this problem so far.

In short -- I'm thinking we need to stop with the separate projects, and start the unified *grid project! @rsignell-usgs, @ocefpaf, @hetland : what do you think? We're trying to get operational code gong here, so we need to nail this down soon, or we'll be off and running with something only we want to use. and I'd really like someone else to write the code for sgrid and the vertical coordinate for us -- so I don't want that :-)

Not sure if unifying them is the way to go. If we keep close interaction between pyugrid and pysgrid development we can keep the projects separated while maintaining their core as generic as possible.

That will allow "extras stuff" to be developed independently on each end. Nothing prevents a third project to use this "core grid stuff", ignoring these extras, and unify them into one module. That way we do not block the development of pyugrid and pysgrid.

However, I think that the only use case for both pyugrid and pysgrid out there is @kwilcox sci-wms. So maybe it is time for him to say something about this :stuck_out_tongue_winking_eye:

PS: Vertical coordinate is "done." We just need to find a way to integrate into what we have.

rsignell-usgs commented 8 years ago

@kwilcox , we could really use your input here -- I think you were the one that originally called for more coordination and standardization of the SGRID and UGRID packages.

Do you think they should stay as coordinated separate packages , or be combined into one uber-grid package?

kwilcox commented 8 years ago

I'm in favor of separate targeted repositories that have a small module that implements the common API we are after. This will allow people to focus on their own code, and add functionality to their own codebase without worrying about a larger repository. As long as the apirus module works, they are free to add other functions not related to the common API.

ChrisBarker-NOAA commented 8 years ago

On Thu, Dec 17, 2015 at 7:55 AM, Kyle Wilcox notifications@github.com wrote:

I'm in favor of separate targeted repositories that have a small module that implements the common API we are after. This will allow people to focus on their own code, and add functionality to their own codebase without worrying about a larger repository. As long as the apirus module works, they are free to add other functions not related to the common API.

OK, but that mean there is an apirus module -- what the heck goes in there?

AS we try to extend py_ugrid to do what we need, we find ourselves writing/needing code that is not in the least UGRid specific:

code for handling interpolation in time
code for handling vertical coordinates
some sort of "Variable" object that will be grid-aware, but hopefully not have to know what kind of grid.
some sort of "collection of variables" object that handles the interaction between the grid and the variables stored on that grid.

Does all this go in the apirus package? then we keep the unstructured grid-specific stuff in it's own package/repo?

And while I appreciate the flexibility of separate projects, if we really want this all to work together, it seems it would be a lot easier to do that if it was all in one repo.

And yes -- what I'm moving towards is essentially re-writing IRIS :-)

-CHB

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

kwilcox commented 8 years ago

apirus will contain base ABCs for plugins to implement and contain very high level classes like Datasetthat provides the nice clean and coherent interface we are looking for. We can base this on stevedore to remove the complications of dealing with plugin discovery and loading. Maybe I'm a dreamer, but this seems sweet to me.

from apirus import Dataset
with Dataset.open('mygrid.nc') as ugrid:
    print("Implementation: ", ugrid.__impl__)  # User shouldn't care
    print(ugrid.functions)
    print(ugrid.subset(bbox=..., variables=...)

Implementation: pyugrid
[
    subset,
    regrid,
    transect,
    iso_surface
]
apiris.Dataset: {
    '__impl__': 'pyugrid',
    variables: [...],
    ... 
}

from apirus import Dataset
with Dataset.open('myglider.nc') as glider:
    print("Implementation: ", glider.__impl__)  # User shouldn't care
    print(glider.functions)
    glider.isosurface(...)

Implementation: trajectory
[
    subset
]
Traceback (most recent call last):
   ...
NotImplementedError: isosurface has not been implemented for trajectory datasets

# apirus/dataset.py
import six
import abc

@six.add_metaclass(abc.ABCMeta)
class Dataset(object):
    def __init__(self, path):
        self.path = path

    @abc.abstractmethod
    def iso_surface(self, z, time=None, geoid=None):
        """ Compute an isosurface

        :param time: A time as a python datetime.datetime or a np.datetime64 to compute the isosurface at. If no time is provided, an isosurface at every time in the Dataset is computed.
        :param z: A value (in meters positive down) to compute the surface at
        :returns: Dataset object (x, y) object of the isosurface if time is specified. Dataset object (time, x, y) if time is not specified.
        """

# pyugrid/apirus/dataset.py
from apirus.dataset import Dataset
import some_iso_calculating_module

class UgridDataset(Dataset):
    def iso_surface(self, z, time=None, geoid=None):
        """ Compute an isosurface

        :param time: A time as a python datetime.datetime or a np.datetime64 to compute the isosurface at. If no time is provided, an isosurface at every time in the Dataset is computed.
        :param z: A value (in meters positive down) to compute the surface at
        :returns: UgridDataset object (x, y) object of the isosurface if time is specified. UgridDataset object (time, x, y) if time is not specified.
        """
        with netCDF4.open(self.path) as nc:
            iso = some_iso_calculating_module.compute(nc, z, time=time, geoid=geoid)
            ... # Create new UgridDataset object to return

ChrisBarker-NOAA commented 8 years ago

OK -- now we are getting somewhere!

from apirus import Dataset

OK, so an apirus.Dataset is an object that holds the grid and the variables, etc on that grid. Essentially this generally maps to a netcdf file (Or OPenDAP URL). Good. I like that.

""" We can base this on stevedore to remove the complications of dealing with plugin discovery and loading """ Personally, while the idea of plugins is cool, I don't really care for that level of automation and they are generally a pain (maybe stevedore relieves that). I'd be just as happy to simply add each new thing as an optional import to APIRUS -- in theory , someone could go write a new plugin and not have to touch apirus code, but really, how many are we going to have?

class Dataset(object):
    def __init__(self, path):
        self.path = path

    @abc.abstractmethod
    def iso_surface(self, z, time=None, geoid=None):

I'm not sure that apirus.dataset should be an ABC -- there is an awful lot of code that could be / should be shared between the various *grid classes. That's the direction pyugrid has been going: a UGrid is the grid, and the data associated with that grid ( UVars ) -- but keeping that all in the grid class seems like a way to duplicate a lot of effort.

So I envision the apirus.Dataset has something like:

dataset.variables dataset.grid

and maybe: dataset.time_axes

and dataset.thing_to_deal_with_vertical_coords

maybe ABCs for the Grid objects and the Variable objects (though we may be able to have a single variable class, if the *grid API is well defined.

with Dataset.open('myglider.nc') as glider:
    print("Implementation: ", glider.__impl__)  # User shouldn't care
    print(glider.functions)
    glider.isosurface(...)

Implementation: trajectory

I"m not so sure about glider trajectories -- is there enough to share to bother?

I see an apirus.Dataset as an abstraction for a field of variables -- a bunch of things, each a function of t,z,lon, lat (t and z optional). I'm hoping to abstract out what kind of grid the data are stored on, but still keep the basic idea that it's a bunch of stuff that exists for a range of time and space --not sure glider trajectories fit that model.

-CHB

@jay-hennen: just want to make sure you're seeing this...

ocefpaf commented 8 years ago

@kwilcox I like your dream :stuck_out_tongue_winking_eye:

jay-hennen commented 8 years ago

@jay-hennen https://github.com/jay-hennen: just want to make sure you're seeing this...

Yup, I've been trying to follow along/get up to speed. I've been working with pyugrid in a very limited setting so far, so the bigger picture is still very much hidden in the mist.

On Thu, Dec 17, 2015 at 2:26 PM, Chris Barker notifications@github.com wrote:

OK -- now we are getting somewhere!

from apirus import Dataset

OK, so an apirus.Dataset is an object that holds the grid and the variables, etc on that grid. Essentially this generally maps to a netcdf file (Or OPenDAP URL). Good. I like that.

""" We can base this on stevedore to remove the complications of dealing with plugin discovery and loading """ Personally, while the idea of plugins is cool, I don't really care for that level of automation and they are generally a pain (maybe stevedore relieves that). I'd be just as happy to simply add each new thing as an optional import to APIRUS -- in theory , someone could go write a new plugin and not have to touch apirus code, but really, how many are we going to have?

class Dataset(object): def init(self, path): self.path = path
@abc.abstractmethod
def iso_surface(self, z, time=None, geoid=None):
I'm not sure that apirus.dataset should be an ABC -- there is an awful lot of code that could be / should be shared between the various *grid classes. That's the direction pyugrid has been going: a UGrid is the grid, and the data associated with that grid ( UVars ) -- but keeping that all in the grid class seems like a way to duplicate a lot of effort.

So I envision the apirus.Dataset has something like:

dataset.variables dataset.grid

and maybe: dataset.time_axes

and dataset.thing_to_deal_with_vertical_coords

maybe ABCs for the Grid objects and the Variable objects (though we may be able to have a single variable class, if the *grid API is well defined.

with Dataset.open('myglider.nc') as glider: print("Implementation: ", glider.impl) # User shouldn't care print(glider.functions) glider.isosurface(...)

Implementation: trajectory

I"m not so sure about glider trajectories -- is there enough to share to bother?

I see an apirus.Dataset as an abstraction for a field of variables -- a bunch of things, each a function of t,z,lon, lat (t and z optional). I'm hoping to abstract out what kind of grid the data are stored on, but still keep the basic idea that it's a bunch of stuff that exists for a range of time and space --not sure glider trajectories fit that model.

-CHB

@jay-hennen https://github.com/jay-hennen: just want to make sure you're seeing this...

— Reply to this email directly or view it on GitHub https://github.com/ioos/APIRUS/issues/13#issuecomment-165598932.