Closed mwengren closed 2 months ago
@benjwadams is looking into this in relation to the above monitoring scripts and troubleshooting why jobs appear to be restarting more frequently than they should.
During today's meeting, we looked at the GCOOS Biological ERDDAP WAF harvest source, which appeared to be running roughly hourly on when the harvest job happened to error out quickly due to a 443 error from GCOOS' server:
Unable to get content for URL: https://gcoos5.geos.tamu.edu/erddap/metadata/iso19115/xml/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='gcoos5.geos.tamu.edu', port=443): Read timed out. (read timeout=60)"))
[Job: 9821d492-1605-4897-b3d0-de004d27d7f5](https://data.ioos.us/harvest/gcoos-waf-historical/job/9821d492-1605-4897-b3d0-de004d27d7f5)
Started: October 10, 2023, 1:01 AM (UTC-04:00) — Finished: October 10, 2023, 1:02 AM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 47fb33cf-a35c-420f-a1f9-9ff93ebbacae](https://data.ioos.us/harvest/gcoos-waf-historical/job/47fb33cf-a35c-420f-a1f9-9ff93ebbacae)
Started: October 10, 2023, 12:01 AM (UTC-04:00) — Finished: October 10, 2023, 12:02 AM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 505f9a44-7a72-4917-89f2-2f4464371db4](https://data.ioos.us/harvest/gcoos-waf-historical/job/505f9a44-7a72-4917-89f2-2f4464371db4)
Started: October 9, 2023, 11:01 PM (UTC-04:00) — Finished: October 9, 2023, 11:02 PM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 9dd14b41-6df1-476a-9f5e-2c252ab1dfee](https://data.ioos.us/harvest/gcoos-waf-historical/job/9dd14b41-6df1-476a-9f5e-2c252ab1dfee)
Started: October 9, 2023, 10:01 PM (UTC-04:00) — Finished: October 9, 2023, 10:02 PM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
Then, a successful harvest would run that was able to get a valid response from GCOOS' server, and take about 14 hours to complete (with a number of errors that are to be expected and constitute a 'successful' harvest):
[Job: 94ecc350-c6d0-40da-8dbb-d1fbb34428fd](https://data.ioos.us/harvest/gcoos-waf-historical/job/94ecc350-c6d0-40da-8dbb-d1fbb34428fd)
Started: October 10, 2023, 1:45 AM (UTC-04:00) — Finished: October 10, 2023, 2:44 PM (UTC-04:00)
2657 errors 0 added 2597 updated 0 deleted 0 not modified
Then, it would go back to the roughly hourly job execution pattern:
[Job: f7b50b2f-9f72-4b8e-8a09-d722504c70f4](https://data.ioos.us/harvest/gcoos-waf-historical/job/f7b50b2f-9f72-4b8e-8a09-d722504c70f4)
Started: October 10, 2023, 5:01 PM (UTC-04:00) — Finished: October 11, 2023, 5:12 AM (UTC-04:00)
1260 errors 0 added 1226 updated 0 deleted 0 not modified
[Job: 1562a12a-9d51-49a9-b605-7d9866dc7cd6](https://data.ioos.us/harvest/gcoos-waf-historical/job/1562a12a-9d51-49a9-b605-7d9866dc7cd6)
Started: October 10, 2023, 4:01 PM (UTC-04:00) — Finished: October 10, 2023, 4:02 PM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 4c553d1f-b6a4-4af6-b12f-3110b93dcab9](https://data.ioos.us/harvest/gcoos-waf-historical/job/4c553d1f-b6a4-4af6-b12f-3110b93dcab9)
Started: October 10, 2023, 3:01 PM (UTC-04:00) — Finished: October 10, 2023, 3:02 PM (UTC-04:00)
1 errors 0 added 0 updated 0 deleted 0 not modified
Perhaps there's a script or config somewhere that's restarting jobs around the top of the hour if the previous job reported any errors? Just a guess. What do you think @benjwadams?
Harvest jobs are still running on a frequency > daily, regardless of what is configured in the CKAN UI.
We tested again with the GCOOS ERDDAP Biological WAF set to Manual, confirmed this is propagated to the database, however there are still routine harvest jobs running.
@benjwadams to look into further troubleshooting.
@benjwadams I believe this issue is still present as of our Catalog meeting today if you're able to look into now that funding is once again available.
@benjwadams Once #238 is resolved, can you re-investigate this one? Harvest job UI settings don't seem to be persisted or respected by the CKAN harvesters.
This is likely due to ckan harvester job-all
https://github.com/ioos/catalog-docker-base/blob/main/contrib/scripts/clear_stuck_harvests.bash#L9 being issued on the stuck job cleanup script. I'm testing removing just this line and think this will fix things.
@benjwadams mentioned during today's meeting that:
Possibly related to cleanup script manually restarting harvest job on failure. Not honoring the manual harvest config flag in settings.
@benjwadams stated in https://github.com/ioos/ckanext-ioos-theme/issues/238#issuecomment-2007755461 that CKAN is honoring the harvest job configurations and this issue can be closed.
If I attempt to set a harvest source to manual, for example:
https://data.ioos.us/harvest/edit/gcoos-erddap-biological
and click Save, the setting appears to be updated in the CKAN UI.
However, there are still multiple (sometimes up to 3) harvests that run for this GCOOS harvest source per day according to the job logs:
https://data.ioos.us/harvest/gcoos-erddap-biological/job
If a harvest source is set to 'Manual', it should only run when manually triggered.
Similarly, if it's set to 'Daily', it should only run 1x per day (not 2 or 3x/day).
We need to figure out why the harvest jobs are not running according to the CKAN settings and fix to be configurable via UI.