Investigate AOOS/CeNCOOS SOS harvesting failures - Githubissues

ioos / service-monitor

A web based catalog of IOOS services and datasets

http://catalog.ioos.us

6 stars 13 forks source link

Investigate AOOS/CeNCOOS SOS harvesting failures #359

Closed benjwadams closed 9 years ago

benjwadams commented 9 years ago

As mentioned in #318, AOOS and CeNCOOS 52North instances appear to be timing out on harvest attempts.

benjwadams commented 9 years ago

aoos_screnshot

Bingo. Increasing the redis queue timeout appears to do the trick for AOOS. Haven't tested against CeNCOOS but I suspect it's a similar story.

benjwadams commented 9 years ago

CeNCOOS is timing out for a slightly different reason: the network:all is extremely slow. Still chugging away after 8+ minutes.

http://sos.cencoos.org/sos/sos/kvp?outputFormat=text%2Fxml%3Bsubtype%3D%22sensorML%2F1.0.1%2Fprofiles%2Fioos_sos%2F1.0%22&version=1.0.0&request=DescribeSensor&procedure=urn%3Aioos%3Anetwork%3Acencoos%3Aall&service=SOS

On the other hand, AOOS's network:all does load.

http://sos.aoos.org/sos/sos/kvp?outputFormat=text%2Fxml%3Bsubtype%3D%22sensorML%2F1.0.1%2Fprofiles%2Fioos_sos%2F1.0%22&version=1.0.0&request=DescribeSensor&procedure=urn%3Aioos%3Anetwork%3Aaoos%3Aall&service=SOS

~~Haven't been able to get the response size for CeNCOOS. It's quite large, whatever it is.~~ Edit: ~9.5 MB, and it takes a long time to load as well. Fundamentally, there are two timeouts: a timeout for OWSLib to grab the data and a timeout for the queue to kill processing of a particular job after a certain amount of time has elapsed.

benjwadams commented 9 years ago

Ok, #361 should help the situation for AOOS. Unfortunately, if we're adding a new service which contains a lot of datasets which don't have a previous harvest, this fix won't apply. I think we ought to queue datasets rather than services and set the timeout there, but that would require a bit of retooling. Also I'm still not sure what to do with the CeNCOOS "network:all" DescribeSensor response. I've tried introducing some code to handle network datasets by introducing a large timeout, but the response is so large and intensive to process that I actually got a server timeout several times when trying to select from it.

lukecampbell commented 9 years ago

So we've improved the rate at which we can successfully harvest AOOS, was 0 and now it's about half.

lukecampbell commented 9 years ago

I've made a breakthrough by removing the 'all' offering from the harvesting process. CeNCOOS is Massive, it's been harvesting on my dev machine for close to 30 minutes now.

lukecampbell commented 9 years ago

Still going....

--------------------------------------------------------------------------------
INFO in harvest [/Users/lcampbell/Documents/Dev/code/catalog/ioos_catalog/tasks/harvest.py:349]:
process_station: urn:ioos:station:gov.usda.nrcs.wcc.snotel:319
--------------------------------------------------------------------------------
Timeout 600
--------------------------------------------------------------------------------
INFO in harvest [/Users/lcampbell/Documents/Dev/code/catalog/ioos_catalog/tasks/harvest.py:349]:
process_station: urn:ioos:station:gov.usda.nrcs.wcc.snotel:320
--------------------------------------------------------------------------------
Timeout 600
--------------------------------------------------------------------------------
INFO in harvest [/Users/lcampbell/Documents/Dev/code/catalog/ioos_catalog/tasks/harvest.py:349]:
process_station: urn:ioos:station:gov.usda.nrcs.wcc.snotel:321
--------------------------------------------------------------------------------

lukecampbell commented 9 years ago

Mark time it's done.

lukecampbell commented 9 years ago

2211 Fixed
7 Unspecified
2 MOORED BUOY
55 Buoy
2094 FIXED MET STATION

Successfully harvested from cencoos.

abirger commented 9 years ago

Hooray!!!

lukecampbell commented 9 years ago

screen shot 2015-03-13 at 9 06 29 am

For summary the solution was basically to increase the timeout to about an hour or two for known large data services, and to skip the "all" offering because that single request would take an equivalent time as the rest of the dataset.

I still an issue with AOOS where one of two things are happening:

The SOS just does not respond (GetCapabilities)
The hostname doesn't have an IP address (DNS error? Really confused as to how this could be)

benjwadams commented 9 years ago

:+1:

carmelortiz commented 9 years ago

Validated fix. AOOS has harvested 19/20 times, CeNCOOS 16/20.

Harvesting failures were all either: a) URLError: <urlopen error [Errno -5] No address associated with hostname> or b) Service Ping Timeout: HTTPConnectionPool(host='sos.cencoos.org', port=80): Read timed out. (read timeout=60)