Complete Refactor of Ulmo

dharhas commented 9 years ago

I'm planning a complete refactor of ulmo with the following (major) target features:

Move to a plugin system for the services (using stevedore)
- This will allow enforcement of a more consistent api (i.e. get_sites, get_stations etc would be harmonized, while still retaining flexibilty for individual plugins).
- Allow including closed plugins that are for non-public datasets
Consistently use Pandas DataFrames internally with options to serialize to Python Dicts and GeoJSON
Move duplicate/common functionality into a common place
Create a common caching system. i.e. expand the hdf5 cache ability that ulmo.usgs.nwis.hdf has to all services
Python 3 support (While retaining Python 2.7 compatibility) using the 'six' module

I will probably tag the release 1.0 to indicate that it will break backwards compatibility.

If folks are interested in contributing to the refactor or wish to discuss these changes in more detail. Please comment on this issue.

emiliom commented 9 years ago

I might be able to contribute, depending on the timing, etc. If nothing else, by posting this comment I'm adding myself to the notifications on discussions on this issue.

Quick question/comment: Are you considering using GeoPandas when the DataFrame has a spatial component? I've used GeoDataFrames a bit, but not enough to have a solid opinion regarding its maturity.

dharhas commented 9 years ago

I'm actually researching spatial indexes right now. I'm leaning away from GeoPandas right now since it has dependencies on shapely and fiona and hence GEOS and GDAL which are a pain to install easily cross platform. I'm considering having a geometry column that is just an array of coords (Point/Line/Poly) to enable some simple bbox filtering.

emiliom commented 9 years ago

Makes sense.

dharhas commented 9 years ago

@emiliom @wilsaj @nathanhilbert @cameronbracken

So do folks have any preference between these two api approaches. I'm leaning towards b) but wanted to get some input.

a) Flat API. You pass service name and dataset name and any other parameters to each call:

stations = ulmo.get_features('usgs-nwis', 'iv', state='TX') data = ulmo.get_data('usgs-nwis', 'iv', features='00824562', start=2014-01-01, parameters='00600')

This makes the api simpler and clearer to use but potentially less flexible. i.e not all services have something analogous to get_features (see usgs.eddn, we would have to raise a 'Not Implemented' on that). It is also a bit more verbose to type. The API would have to cover all the main use cases.

b) load a plugin by specifying service and dataset and then use that.

nwis = ulmo.load_service('usgs-nwis', 'iv') stations = nwis.get_features(state='TX') data = nwis.get_data(features='00824562', start=2014-01-01, parameters='00600')

This api is more flexible since each plugin could define its own api, we would have some base classes to maintain consistency for similar plugins (i.e. timeseries, raster etc) to keep the api reasonably consistent across plugins.

jirikadlec2 commented 9 years ago

how would you use b) with the CUAHSI WaterOneFlow / WaterML web services?

would it be something like:

cuahsi = ulmo.load_service('cuahsi-his', 'http://hydroportal.cuahsi.org/GLEON_Sunapee/cuahsi_1_1.asmx') stations = cuahsi.get_features() data = cuahsi.get_data(features = 'GLEON_Sunapee:SUNAPEE', variable = 'GLEON_Sunapee:watertemp', method = 9)

or would you consider the 'CUAHSI HIS Central' as a service and each of the HydroServers as a dataset?

jirikadlec2 commented 9 years ago

About the Python 3 support, I suggest that you can remove the dependency on suds, according to my knowledge the suds package only exists for Python 2 and it's not really used by ulmo except for the CUAHSI WaterOneFlow. One actively maintained replacement package to consider is PySimpleSOAP: https://pypi.python.org/pypi/PySimpleSOAP

dharhas commented 9 years ago

The api I am considering is:

ulmo.list_services -> Gets a list of services available to ulmo. (i.e. nwis, cdec etc) ulmo.list_datasets -> Gets a list of datasets available for a given service. (There might only be one) ulmo.load_service -> Loads a specific service/dataset combo.

So I think 'CUAHSI HIS Central' would be the service and each HydroServer would be a dataset.

The other change I'm considering is moving each service into its own repo as an extension, potentially with its own maintainer. The pattern would be similar to the system the 'flask' package uses. i.e. we would have separate python ulmo (that has common functions and base classes) and ulmo_cuahsi, ulmo_nwis, etc.

Advantages:

ulmo.list_services() would list which extensions were installed
individual service extensions could be maintained by folks who use them heavily and are available to contribute.
each extension can have its own dependencies. i.e. cuahsi-his requires suds but other services don't.
We could also pin extensions to explicit versions of ulmo. For example, if cuahsi-his is not py3 ready we could pin it to the last py2 version of ulmo.
you could easily write your own closed/internal plugins

Disadvantage:

The main disadvantage of this approach is you would know have to install several modules to get full functionality, but I guess we could make a meta package that pulled in all supported plugins.

I don't have the bandwidth to support all the data sources so distributing the load would help out enormously. @jirikadlec2 would you be available convert from suds to PySimpleSOAP? I currently don't use HIS services much.

dharhas commented 9 years ago

I'm going to experiment with the approach of having a package called 'ulmo-common' and converting the services to a extensions named 'ulmo-extensionname'

ulmo-dev / ulmo

Complete Refactor of Ulmo #109