ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

What is the goal of all this anyway? #20

Open ChrisBarker-NOAA opened 8 years ago

ChrisBarker-NOAA commented 8 years ago

@ocefpaf, @rsignell-usgs, @hetland, @ayan-usgs, @kwilcox, @rhattersley, @jay-hennen, @acrosby, @pelson: trying to get everyone here.

Putting this issue in APIRUS, because the point of APIRUS was to get pysgrid and pyugrid to share an API and be interoperable, so this question is important to both projects. And I've gotten very frustrated with the lack of consensus about what we are trying to do with these projects.

On the sgrid project, there have been discussions about what code "belongs" in pysgrid (and thus pyugrid)

Key to this is whether these tools are just "a spec parser".

I don't think that pysgrid nor pyugrid should be thought of as a spec parser - that's actually a secondary feature. I started pyugrid not as a spec parser, but as object that embodies the DATA MODEL encoded in the UGRID spec, AND provides tools for manipulating that data model -- NOT a parser for the file format.

The file format parser was actually added later (though not much later...) and indeed, it really should be separated more -- i.e. a compliance checker requires parsing the spec, but not creating any data objects at all.

Working with unstructured grids and staggered grids is a pain, and hard to do efficiently, and you have to do it differently with each of them -- if we want to provide a SINGLE API that allows us to work with this gridded data, then we need to provide these spatial indexing and interpolation tools somewhere.

One example: cell location:

An argument has been made that, for instance, sciwms uses rtree to do cell location, so we don't need to add cell location to these tools. But I'd argue that sciwms added its own cell location code because neither pyugrid nor pysgrid had it -- they HAD to implement it themselves. But since many people that work with these grids need that feature, why make everyone re-invent it themselves?

Same for interpolation. Note that providing these features in no way means that people HAVE to use that particular method -- of course, there will always have access to the underlying data, so can do whatever they want with it.

I've pretty much given up on iris integration, but if that ever does happen, then presumably iris will need the grid navigation, etc. code, too -- why make they write it themselves?

I think I've made the case that this code needs to be somewhere, so the question is where.

If you all want pyugrid and pysgrid to be "spec parsing" tools, then fine -- we'll fork and put our code in another project, and we'll probably use pyugrid and pysgrid do do the parsing. But I think that's a mistake -- we're trying to contribute this code because it's useful to others, and because we're hoping others will maybe contribute parts we haven't written yet, or, at the very least, provide additional testing and debugging. So it's seems like a bad idea to put this code embedded in our tools.

So that leaves creating a pile of other projects. Sure, I like the "unix philosophy" of small tools that each do one thing well, and putting them together to build more complex systems, but there is a lot of overhead to maintaining individual packages, both for the developer and the users. And teh really successful packages are successful (arguably) because they DO include the kitchen sink:

matplotlib has a LOT in it. do you think it would have any anything like its success, if for instance, users had this workflow:

install MPL.

do some nifty plotting

now I need to do a contour plot!

Hmm, now I need a contouring package.

Yeah! nice plot!

That seems to be exactly what's being proposed here:

Install pysgrid Go to plot some stuff

Contrast this with:

install pysgrid go to plot some stuff

Now I need to get some values interpolated to particular points

Now I need the velocities at an arbitrary point

(I sure have gotten a lot done without having to know about the details of sgrids!)

OK, so we could accomplish much of this with a set of inter-dependent packages, and then the infrastructure in conda and pip to make it easy to install, and the documentation that makes it clear what package(s) you want to do what. But why? This grid navigation and interpolation code needs an intimate knowledge of the data structure in the *grid objects -- they really aren't going to be useful for anything else, so why incur the overhead of multiple separate packages rather than putting it all together?

I guess it comes down to this: adding functionality to a package has only an upside: adds features, and people that don't need those features can ignore it.

The reason for separate packages for a feature is if that package might be useful on its own, for other purposes, without the core package. That's why we made cell_tree2d a separate package. Though even in that case, if we were to bring it into APIRUS or something, it would make dependency management easier

Wow -- that was a long rant!

ayan-usgs commented 8 years ago

An argument has been made that, for instance, sciwms uses rtree to do cell location, so we don't need to add cell location to these tools. But I'd argue that sciwms added its own cell location code because neither pyugrid nor pysgrid had it -- they HAD to implement it themselves. But since many people that work with these grids need that feature, why make everyone re-invent it themselves?

There is precedent in pysgrid for having functionality beyond a spec parser. For instance, the processing_2d module contains a number of functions used by sci-wms to create the WMS responses; this functions also had their analogs in iPython notebooks that @rsignell-usgs shared with me. It seemed that there were some common tasks that needed to be done with respect to staggered datasets, so it made sense to stick them in a module in pysgrid.

I don't have any major philosophical heartburn with adding some common utilities as long as they are distinct from the code that brings in the data and parses the convention(s), and the utilities are consistent with the convention. The consistency issue alludes to an older PR about whether their are multiple grids within a staggered grid, or if it's one grid with components. While that issue has been resolved (I think it has anyway...), it was an example of useful code that wasn't consistent with the convention at the time.

rsignell-usgs commented 8 years ago

I think we need cell finding and interpolation to be within pysgrid, pyugrid, etc because that will allow these packages to have the common functionality that allow the duck typing we want APIRUS to have.

In other words, if we want APIRUS to be able to return a vertical section along an arbitrary x,y path, regardless of model type, then pycf (iris?), pugrid, pysgrid, etc, will need to each have cell location and interpolation capabilities that will return the requested section in a common way.

So I'm now in the camp that each of these libraries needs a fair amount of functionality. Which is okay because the cell finding and interpolation routines needed for different types of grids is often different as well.

rsignell-usgs commented 8 years ago

If @kwilcox is on board with this, then I think it's time to merge https://github.com/sgrid/pysgrid/pull/85.

kwilcox commented 8 years ago

I'm not keeping up with the pysgrid and apirus issues/comments as much as I should be!

It will be beneficial to include the most optimal "cell location" algorithm in each of the individual grid libraries. This will allow for one implementation, and it can be improved over time by the people who work the closest with the grid type. Future users of a standardized API shouldn't have to optimize their own index trees to find "closest cell", "nearest neighbour", etc. (unless they want to).

I don't know if this was discussed, but any "cell location" setup should be built lazily; when the user specifically asks for it do be done or the user calls a function that requires it. I wouldn't want to burden a user with the setup time for an index tree if they don't need it.

https://github.com/sgrid/pysgrid/pull/85 is :+1: from me, but I would have preferred some better tests, more documentation (in the form of example notebooks), and more inline code comments (so I don't have to reverse engineer every function to figure out what is happening). You have to start somewhere though, and @jay-hennen did a great job improving the usability/functionality of pysgrid.

ChrisBarker-NOAA commented 8 years ago

Thanks for all your positive comments -- let's move this forward!

@kwilcox wrote:

I would have preferred some better tests, more documentation (in the form of example notebooks), and more inline code comments

Absolutely! documentation is second to features in making your code usable by others. I think tests are actually third after those -- at least for the "getting people to use it" aspect).

@ocefpaf did an absolutely great code review of sgrid/pysgrid#85, so it'll get better.

As for example notebooks and the like -- it would be great if others could contribute some of those, we only know what we need it for. Maybe in the form of "here's what I'd like to do", and we'll fill in the how to do it part.

i.e. a sample dataset, and a problem to solve with it. (i.e, plot a cross section), etc...

One of the big challenges for test and examples is data sets that aren't too huge -- but OpenDAP sources are a good start for notebooks.