pyugrid / pyugrid

Handle UGRID-convention NetCDF Data Model.
https://pyugrid.github.io/pyugrid
Other
35 stars 35 forks source link

Add an optional dependency list #82

Closed ocefpaf closed 7 years ago

ocefpaf commented 9 years ago

These are not installed by default right now and we have to document is:

ocefpaf commented 8 years ago

Right now wxPython is the only optional dependency. PRs #111 and #98 might add a few.

That is a staggering amount of dependencies and some of those are hard to install. I would like to start a discussion to make those PRs external modules that uses pyugrid rather than part of pyugrid.

Ping: @cbcunc, @ChrisBarker-NOAA, @rsignell-usgs, @kwilcox (the only guy that uses pyugrid in production :wink:) and @rhattersley. Also the authors of the PRs @jay-hennen and @bekozi.

ocefpaf commented 8 years ago

Bonus points for those who read the blog and watch the video :smiling_imp:

http://matthewrocklin.com/blog/work/2013/07/02/Packages

ChrisBarker-NOAA commented 8 years ago

I read the blog post -- good points. But note that he also addresses the absurdity of taking it too far.

In this case: there is substantial overhead in maintaining 5 or six separate packages. in practice, py_ugrid has been around for a couple years, and seen very little development, until the recent flurry. So I don't think making it HARDER to contribute and maintain is a good idea.

And a lot of this stuff (all?) is very tightly tied to py_ugrid.

I really wish the pip and conda had a better system for optional dependencies, but they don't -- maybe they will grow one some day. But for now, I think we can still have optional dependencies informally:

We could also crate a conda meta-package (and maybe pip?) that is nothing but the dependencies. So a user that want py_ugrid essentially only for file parsing would conda install py_ugrid, and someone that wanted the whole enchilada could conda install py_ugrid-sumo (or whatever).

ocefpaf commented 8 years ago

To be completely honest since conda I don't worry too much about dependencies. Take iris and cartopy for example, they were a nightmare to install a few years ago before conda packages.

I am more worried about the scope of the library. I see pyugrid and pysgrid like a thin grid parser that knows about the grid and expose that information to other libraries. Maybe that :ship: has :boat: and that is no longer possible (or desirable). Still, we can aim to keep both libraries simple and add extra functionality via external packages.

One example is the cell_tree stuff. Kyle and I already have similar functionality using two different approaches. I use KDTrees (and I don't even use the pyugrid built-in because I convert the grid to Cartesian coordinates first) and Kyle uses RTrees (because he is worried about serialization I believe).

So why cell_tree cannot be something that takes an ugrid object and work from there? (Warning! I am a big fan of that functional approach, so I am biased here!)

ChrisBarker-NOAA commented 8 years ago

Well, CellTree is a utility, much like rtree or kd-tree, but designed specifically for this kind of use case.

Give it a try -- I suspect it's performance will be substantially better than either of those other options -- that's why we wrote it :-)

But yes, we could have a navigable-ugrid package that used those things...

-CHB

ocefpaf commented 8 years ago

Give it a try -- I suspect it's performance will be substantially better than either of those other options -- that's why we wrote it :-)

I am looking forward to move my stuff to cell_tree! Sounds very promising.

bekozi commented 8 years ago

@ocefpaf, I am open to making the flexible mesh IO utilities an external module if that's the way pyugrid wants to go. All the extra dependencies are for reading from and writing to GIS files. The actual "flexible mesh" class does not require gdal, shapely, etc. The testing in #98 makes sure that all the optional dependencies do not interfere with operation of the trunk - so that's good anyway. However, I need to check that the "flexible mesh" class may also be imported without optionals.

ocefpaf commented 8 years ago

if that's the way pyugrid wants to go.

I guess that is the right question. Everything I said above is my opinion.

It would be nice to get a few comments from @pyugrid/owners.

ChrisBarker-NOAA commented 8 years ago

I still think we should jsut keep it all together, and have the dependencies optional -- unless there is a reason other than dependencies to split it all up. But:

I see pyugrid and pysgrid like a thin grid parser that knows about the grid and expose that information to other libraries.

I actually see it in reverse. py_ugrid, in any case, is a standard container for UGRIDS, and the logic required to work with those grids navigating the gird, interpolation, building the topology, etc. In fact, the first version of UGrid had some logic, but no reading/writing of netcdf. So if I were to break it up, I"d be tempted to keep all the logic, etc in py_ugrid, but make a separate package for netcdf reading/writing. And it it's about dependencies, netcdf is one of the ugly ones :-)

but again, I think we should keep it all together and handle the optional dependencies with meta packages, or ....

ocefpaf commented 8 years ago

So if I were to break it up, I"d be tempted to keep all the logic, etc in py_ugrid, but make a separate package for netcdf reading/writing.

If by logic you mean the UGRID conventions. Sure! The idea is tempting. I don't think it is doable though because the UGRID logic is tied to the hip to netCDF files. (Same as CF, SGRID, etc.)

and the logic required to work with those grids navigating the gird, interpolation, building the topology

  • navigating the grid: is an open statement and that is why I thinl it should be external. We can slice/index it. Anything else (whatever-Tree) is optional IMO.
  • interpolation: the same as above. What will be the scheme? What will be the re-gridding technique? If those question don't have one obvious answer and/or people might look for different solution, then it should be external too.
  • building the topology: that I agree. Building the topology is to the conventions like the conventions is to netCDF files.

Anyway, that is only my opinion. If we do the optional dependencies right, and keep the code clear, adding more stuff should not prevent people from using pyugrid as the thin grid parser I imagine.

acrosby commented 8 years ago

I think I agree with @ocefpaf , and as a general rule prefer to have many lightweight sets of tools that build on one another rather than one super complicated do-it-all module that imposes particular ways of doing things under the hood. Additionally, given the way that I deploy software to HPC resources at OWI, I often am not using or relying on Anaconda.

Take all this with a grain of salt though, since I also don't use pyugrid in production. (However I would love for it to get to the point where I do, since I end up replicated just the core bits that I need by hand all the time.)

I propose having a pyugrid(-core or something), and then have things that are built-on/rely-on for extended, or specific, functionality to be separate modules like pyugrid-shp, pyugrid-celltree, etc. with their own dependencies. I don't see a reason they couldn't all piggy back off of a common namespace though.

Interested in what @kwilcox has to say.

ocefpaf commented 8 years ago

since I end up replicated just the core bits that I need by hand all the time.

Let's work to solve that out. Can you point me to any code where you did that? And/or write quick summary of what is the core you are using?

ChrisBarker-NOAA commented 8 years ago

OK,

So I think we all agree that the goal here is to have a project that can and is widely used for a variety of use cases. But how do we get there?

@acrosby wrote:

prefer to have many lightweight sets of tools that build on one another rather than one super complicated do-it-all module that imposes particular ways of doing things under the hood

sure -- but having multiple features in a library neither makes it super complicated, nor "imposes" a particular way of doing things. If there is an interpolation method in pyugrid, and you want to use a different interpolation method, there is no reason at all that you can't write your own, same as if it isn't there. This is Python -- no private attributes you'd need to access that you can't.

I often am not using or relying on Anaconda

OK, so conda does not necessarily solve the dependencies problem -- I agree that we shouldn't say we can ignore dependencies as an issue if we have conda packages for them.

I would love for it to get to the point where I do, since I end up replicated just the core bits that I need by hand all the time.

So why haven't you used it? -- if it was missing a feature or two, why not contribute those, rather than replicating it by hand? This highlights to me that making a project easy to contribute to is key.

I don't see a reason they couldn't all piggy back off of a common namespace though.

What are you suggesting here? namespace packages? I've found those to pretty much totally suck -- I'd really rather not go there.

Anyway, what problem are we trying to solve here? If anyone can install py_ugrid, and then use the bits of it that they want, how does it hurt that there are bits that you aren't going to use? This is simply not that big a package (not scipy here!)

And if you split off into multiple packages, it's a lot harder for folks to find what they need, and harder to contribute changes, improvements. This stuff is all going to be tightly coupled, the odds are good (particularly this early in the project) that adding something like an interpolation routine will require additions to py_ugrid itself -- so kind of a pain to keep stuff in sync.

I suggest that we keep it in one package until/unless it really does grow to a "one super complicated do-it-all module".

So the problem we need to address is optional dependencies. I agree that if all you want to do is read a UGRID file and pull out a bit of data, you shouldn't need to have MPL, shapely, cell_tree, gdal, and who knows what the heck else installed. But I propose we handle that with optional dependencies, rather than splintering up into 5 or 6 or ?? packages

ocefpaf commented 8 years ago

This highlights to me that making a project easy to contribute to is key.

Indeed. We need a more agile review system and some basic contributors guidelines/documentation for that to happen. I am travelling now, but I will work on the docs once I get back home.

Regarding the package functionalities and optional dependencies... To be honest we cannot answer any of that yet. The package is young and we need to get it out there before figuring this out.

Note that the extra dependencies and installation problems are not the only issues in a big package. The code base can get tangled up and/or grow in complexity very quickly. That will drive away contributors and users.

acrosby commented 8 years ago

@ChrisBarker-NOAA all fair points, I agree. I was waxing idealist before.

I think I need more clarity about the goals, is it an interface to netcdf-cf-ugrid that provides a common and consistent api or are there aspirations for more generalized topological functionality? I do see some utility in a the base class that has some high level functionality like nearest n features, walk elements/edges/nodes/etc using neighbors, spatial subset, data conditional subset, voronoi, well constructed mesh appends, and triangulation/geometric stats--but these features may not be necessary.

I guess not all of my meshes/data originate in netcdf, so it is tempting to bypass the UGrid class, and just implement higher functionality separately. Interpolation aside, do we want/need higher level topological bits here or to serve as the base for others to build those things in different systems? UGrid class akin to a Pandas index or dataframe could be really useful, but does it belong here?--I don't have an answer, I'm just curious.

ChrisBarker-NOAA commented 8 years ago

Ye,s we do need to clarify goals, however, we did do a tiny bit of that (last SciPy?) in the form of what we wanted to API to support, and that included stuff like:

"give me a cross-section from this point to that point" etc.

And we are talking bout having py_ugrid and py_sgrid share a common API for that sort of thing. So it's always been clear (to me, anyway :-) ) that py_ugrid is about abstracting the grid itself, not just dealing with netcdf.

I guess not all of my meshes/data originate in netcdf, so it is tempting to bypass the UGrid class, and just implement higher functionality separately

Bingo! all this darn netcfd talk has given you the wrong impression -- py_ugrid is NOT primarily about netcdf. To the extent that it's about the netcdf UGRID standard, it's about implementing the data model from that standard, not the file format. If it's not useful to you if you're not using netcdf, then we have failed (or haven't gotten there yet...). You absolutely should not need netcdf to be anywhere in the picture for py_ugrid to be useful.

UGrid class akin to a Pandas index or dataframe could be really useful, but does it belong here?

I think that's EXACTLY what belongs here :-)

In our case, we work with meshes not in netcdf, and meshes in non-compliant netcdf, and we're moving toward py_ugrid being the place to bring all that together.

I do see some utility in a the base class that has some high level functionality like nearest n features, walk elements/edges/nodes/etc using neighbors, spatial subset, data conditional subset, voronoi, well constructed mesh appends, and triangulation/geometric stats--but these features may not be necessary.

It's certainly an option to build all that on top of py_ugrid, instead of in py_ugrid, but that requires standardization of the core data structure (py_ugrid), so why not put it in py_ugrid? Personally, I can't see putting all that in py_ugrid, but yes, a fair bit of it (not voronoi triangulation, that feels like "building a grid" rather than " working with a grid", but sure, maybe even that.

I guess they key to me is that if we each build those pieces we need on top of py_ugrid, we are going to get a heck of a lot less code sharing than we could.

Note that the extra dependencies and installation problems are not the only issues in a big package. The code base can get tangled up and/or grow in complexity very quickly. That will drive away contributors and users.

sure -- but let's wait 'till we get there. Breaking the whole package up will drive away contributors and users also -- as will too much focus on process/guidelines -- people use a package because it's useful, and because it's well documented (as you don't know how useful it is if it's not well documented), not because the coding style is consistent.

For now, we want to be sure we don't gratuitously break the API, but other than that, let's get some functionality in there!

acrosby commented 8 years ago

@ChrisBarker-NOAA OK sounds good. So I guess the next step is to ensure that the PR's do use some sort of try/except or conditional importing for the extra dependencies and functionality that depend on them. I suppose an optional requirements.txt would be a good addition in this case, and the PR's could provide changes to it for their optional dependencies.

I would also be inclined to add a text/summary output as part of the setup.py install that spits out what external modules were found, and what functionality is or is not available as a result--how does that sound?

For now, we want to be sure we don't gratuitously break the API, but other than that, let's get some functionality in there!

ChrisBarker-NOAA commented 8 years ago

I would also be inclined to add a text/summary output as part of the setup.py install that spits out what external modules were found, and what functionality is or is not available as a result--how does that sound?

So far, these are all run-time dependencies, so I'm not sure this is helpful.

I don't think pip has a way to di such a check/display such a message at install time.

Conda may -- I'm not sure.

My thoughts for basic workflow:

User pip installs or conda installs py_ugrid.

User does something like:

grid.save_as_shapefile()

User gets an ImportError with a message saying:

""" Shape file saving requires the Fiona Package. It can be installed with:

pip install fiona or conda install fiona """

I also think we should have a conda py_ugrid_sumo meta package that installs all of the optional dependencies.

I don't know if it's possible to make a dependency only wheel, but we shouldn't do that until all the dead are easily pip installable anyway.

-CHB

acrosby commented 8 years ago

Yeah, we would have to add our own routine to be called during the setup.py. They are runtime dependencies, so you are right, it doesn't really matter at install time.