ulmo-dev / ulmo

clean, simple and fast access to public hydrology and climatology data.
http://ulmo.readthedocs.org
Other
169 stars 63 forks source link

Add (optional?) support for cache=None to suds in cuahsi.wof.core.py requests #94

Closed emiliom closed 10 years ago

emiliom commented 10 years ago

Howdy @wilsaj . I've been using ulmo cuahsi wof access quite a bit. Independently, I recently ran into the problem with suds that my batch hourly, multi-station data downloads (FYI, from CUAHSI WOF from HIS Central, but not with ulmo) was creating a lot of temp files on Linux (Ubuntu) /tmp space. So much so that I learned the hard way when it filled up a disk!

The default behavior of suds can lead to this, as it caches WSDL's and possibly other stuff (incidentally this behavior of writing to /tmp directories is also considered a bit of a security issue).

Calling suds.client.Client with cache=None prevents this writing of cache files; it may have performance side effects, but I haven't looked into it. I'd like to be able to run ulmo cuahsi.wof requests (like get_values) with suds cache=None. I can make this change to https://github.com/ulmo-dev/ulmo/blob/master/ulmo/cuahsi/wof/core.py on my fork and submit a pull request. But I wanted to hear your recommendation first. Should cache=None be the hard-wired behavior? Or should every WOF request in cuahsi/wof/core.py accept an optional suds cache argument (with a default value, to be decided)?

Incidentally, I can take care of #64 while handling this. That looks very easy.

wilsaj commented 10 years ago

Thanks for looking into this and offering to work on it, Emilio.

My feeling on caching is that it should at least be optional, but I'll defer to you on what the default should be since you are the expert on using ulmo for CUAHSI web services.

The tradeoff is that not caching will add an overhead of a full WSDL request (~87k) and whatever time it takes to build the python Client object every time a wof.function() is called, since a new client is instantiated at the beginning of each call. That will have performance implications, and is kind of inconsiderate to WOF servers. The best thing to do would probably be to instantiate a single suds.Client and hold a reference to that from within Ulmo instead of creating a new one with each call.

The security concern is a valid one. I thought suds was only caching WSDL responses. It turns out that it is also picking/unpickling suds.Client objects. A maliciously crafted pickle file can execute arbitrary code, so that's pretty shoddy. Filling up a disk is also, obviously, not cool.

I'll be happy to accept a pull request, and taking care of #64 would be very welcome as well.

emiliom commented 10 years ago

Thanks, Andy. I hadn't thought about the implication for WOF servers of not caching. You're right. More reasons to not be a SOAP fan ;)

BTW, as far as I can tell suds has been sort-of abandoned for the last 3-4 years. The last release on pypy is 4 years old. A forked version over at Bitbucket seems much more active, but I don't have time to look into it.

As for this:

The best thing to do would probably be to instantiate a single suds.Client and hold a reference to that from within Ulmo instead of creating a new one with each call.

Are you suggesting a global-scope variable created by ulmo, or an instance that can be passed around as an argument to all WOF requests? I'm all ears! Are there examples/analogs elsewhere in ulmo?

jirikadlec2 commented 10 years ago

Andy and Emilio, You may be interested in knowing that some of the WOF servers on HIS Central also have a REST-like endpoint so you don't have to use the SOAP with these servers. For example, the SNOTEL service: http://drought.usu.edu/snotel/cuahsi_1_1.asmx/GetSitesObject?site=&authToken= http://drought.usu.edu/snotel/cuahsi_1_1.asmx/GetSiteInfoObject?site=SNOTEL:51K05S&authToken= http://drought.usu.edu/snotel/cuahsi_1_1.asmx/GetValuesObject?location=SNOTEL:51K05S&variable=SNOTEL:PILL&startDate=2012-01-01&endDate=2012-08-31&authToken=

wilsaj commented 10 years ago

BTW, as far as I can tell suds has been sort-of abandoned for the last 3-4 years. The last release on pypy is 4 years old. A forked version over at Bitbucket seems much more active, but I don't have time to look into it.

Yeah. Fortunately, SOAP is just a legacy protocol these days. A cursory look suggests that other python SOAP clients exist, but I don't know what state they're in - nothing looks very active.

Are you suggesting a global-scope variable created by ulmo, or an instance that can be passed around as an argument to all WOF requests? I'm all ears! Are there examples/analogs elsewhere in ulmo?

Not passing around a client object would be better, for the sake of API usability and not introducing a breaking change. A sketch of the idea is this:

_suds_client = None

def _get_client(wsdl_url):
    global _suds_client
    if _suds_client is None or _suds_client.wsdl.url != wsdl_url:
        _suds_client = suds.client.Client(wsdl_url, cache=None)

def any_of_the_wof_functions(wsdl_url, blabla):
    suds_client = _get_client(wsdl_url)
    ....

@jirikadlec2 that's good to know. Straight HTTP endpoints make a lot more sense for WOF. Do we still need to support WOF over SOAP indefinitely or are there plans to upgrade WOF services to the REST-like endpoints and deprecate SOAP altogether?

emiliom commented 10 years ago

Thanks, @wilsaj . Your sketch looks great. Totally agree that a goal must be to minimize changes to users. That probably means, in addition to your suggestion, the use of an optional "sudscache" parameter in each of the WOF calls, with a default that reproduces current behavior.

I'll try to take a crack at this in the next couple of days. I'm highly motivated, b/c I have some operational things that I've ported to ulmo from plain suds WOF requests, but I absolutely need it to use cache=None ASAP, before I can fully deploy it.

@jirikadlec2 , thanks for your comments. Good reminder. But I believe that the only service type that WOF 1.x defined was SOAP, so we likely can't assume that any random WOF server supports anything other than SOAP. As for the future, that's a different ballgame. There won't be a "WOF 2.0" as far as I know. With WaterML 2 being an OGC standard based on OGC components (GML, O&M, and WaterML 2 itself), the future service access will be via SOS, which has been based on HTTP POST/GET (one could call an SOS GET request REST-like, but it might be stretching the meaning). OGC is currently circulating an SOS 2 recommendation for WaterML 2 access; if it includes SOAP, I'm sure it's not front and center.