fetchSCAN: use standard methods for managing curl handles, connection timeout etc.

brownag commented 1 year ago

To avoid stopping when hitting curl timeout e.g.

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [wcc.sc.egov.usda.gov] Connection timeout after 10006 ms

brownag commented 1 year ago

Also fix fetchSCAN(site.code=2142, year=1981:2022)

Error in FUN(X[[i]], ...) : 
  Item 2 has 8 columns, inconsistent with item 1 which has 7 columns. To fill missing columns use fill=TRUE.

dylanbeaudette commented 1 year ago

Thanks. I was planning to do some work with / on this function this week. Any idea of the timeout is high latency (our network / VPN) or the SCAN servers? I'd suspect (and will ask) that the SCAN servers may have some throttling or access constraints leading to some of the timeouts.

Upping the timeout to 10 seconds seems to resolve the timeout issues on a relatively fast connection.

Depending on the possibility of throttling / access constraints, I'd like to try parallel queries.

Given the possibility of timeouts / lost data, what do you think about optionally computing an MD5 hash on the result? I suppose that this could be done external to this function as there is no reference other than a "clean", previous run. I'll think some more tomorrow and develop some use cases.

brownag commented 1 year ago

Any idea of the timeout is high latency (our network / VPN) or the SCAN servers?

In my recent experience it has been randomly dropping requests/timing out. It does not appear to be directly related to the payload. While I suspect the network could be implicated in general problem of timeouts, this does seem at least somewhat server-related.

Upping the timeout to 10 seconds seems to resolve the timeout issues on a relatively fast connection.

The initial issue here in OP was timing out with the default 10 seconds. I verified that most requests should be able complete in way less than that (at least at daily granularity). I think that indeed timing out on valid requests is a problem, so we should use a much larger timeout, and then return NULL with the error message that is exceeded. This allows the server adequate time to respond and or issue a proper error, so we can distinguish it client-side.

The other intention here (which I did not write into issue narrative) was to standardize the use of timeout for the soilDB CURL handle used by (almost every other network-based request?) along with fetchSCAN(). I will add the NULL back in (https://github.com/ncss-tech/soilDB/commit/290ed722e1871206f43f4dc0d46799df017d6206), and the default timeout will match our default CURL handle @ 300 seconds

Depending on the possibility of throttling / access constraints, I'd like to try parallel queries.

That seems like something that would create problems, not solve them (with respect to the server). I agree it could be useful, but is there really anything stopping a user from making parallel calls to fetchSCAN() right now?

Given the possibility of timeouts / lost data, what do you think about optionally computing an MD5 hash on the result? I suppose that this could be done external to this function as there is no reference other than a "clean", previous run. I'll think some more tomorrow and develop some use cases.

This was a concern to me also. As I said above I think I should add the NULL result back in (on new substantial timeout). It is possible in simple cases to cross check results v.s. queried but it could be lots of output and extra logic, and it wouldn't work as well with the more generic query formats probably. One could at least check site IDs and sensors within years, but I think that just returning nothing on serious server-side problem is the best thing we can do for data quality/completenesss without going way more complex

I don't know how you would manage "authoritative" md5 hashes to compare against? Not really thinking deeply about it... but I am not sure this is a viable option. If one were to implement this at that point we should probably just snapshot all of the historic SCAN/SNOTEL data, and update it annually, with fetchSCAN() just intended for one-off queries or more recent data that hasn't been incorporated into snapshot.

dylanbeaudette commented 1 year ago

OK, good ideas. I like the idea of a common set of curl options. I think that you are more familiar with that, would you mind implementing?

Good point on keeping the function as simple as possible: parallelization / MD5 hashes can all be done outside of the function.

I will add some notes to the documentation about the possibility of server-side issues. I get this with CDEC sometimes too.

brownag commented 1 year ago

The items I had in mind for this issue are now addressed w/ https://github.com/ncss-tech/soilDB/commit/290ed722e1871206f43f4dc0d46799df017d6206

ncss-tech / soilDB

fetchSCAN: use standard methods for managing curl handles, connection timeout etc. #284