ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Harvest jobs not configurable via UI #247

Closed mwengren closed 2 months ago

mwengren commented 8 months ago

If I attempt to set a harvest source to manual, for example:

https://data.ioos.us/harvest/edit/gcoos-erddap-biological

and click Save, the setting appears to be updated in the CKAN UI.

However, there are still multiple (sometimes up to 3) harvests that run for this GCOOS harvest source per day according to the job logs:

https://data.ioos.us/harvest/gcoos-erddap-biological/job

If a harvest source is set to 'Manual', it should only run when manually triggered.

Similarly, if it's set to 'Daily', it should only run 1x per day (not 2 or 3x/day).

We need to figure out why the harvest jobs are not running according to the CKAN settings and fix to be configurable via UI.

mwengren commented 8 months ago

xref harvest monitoring script: https://github.com/ioos/catalog-docker-base/blob/master/contrib/scripts/clear_stuck_harvests.bash

and https://github.com/ioos/catalog-docker-base/blob/master/contrib/scripts/clear-harvests.sh

mwengren commented 8 months ago

@benjwadams is looking into this in relation to the above monitoring scripts and troubleshooting why jobs appear to be restarting more frequently than they should.

During today's meeting, we looked at the GCOOS Biological ERDDAP WAF harvest source, which appeared to be running roughly hourly on when the harvest job happened to error out quickly due to a 443 error from GCOOS' server:

Unable to get content for URL: https://gcoos5.geos.tamu.edu/erddap/metadata/iso19115/xml/: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='gcoos5.geos.tamu.edu', port=443): Read timed out. (read timeout=60)"))
[Job: 9821d492-1605-4897-b3d0-de004d27d7f5](https://data.ioos.us/harvest/gcoos-waf-historical/job/9821d492-1605-4897-b3d0-de004d27d7f5)
Started: October 10, 2023, 1:01 AM (UTC-04:00) — Finished: October 10, 2023, 1:02 AM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 47fb33cf-a35c-420f-a1f9-9ff93ebbacae](https://data.ioos.us/harvest/gcoos-waf-historical/job/47fb33cf-a35c-420f-a1f9-9ff93ebbacae)
Started: October 10, 2023, 12:01 AM (UTC-04:00) — Finished: October 10, 2023, 12:02 AM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 505f9a44-7a72-4917-89f2-2f4464371db4](https://data.ioos.us/harvest/gcoos-waf-historical/job/505f9a44-7a72-4917-89f2-2f4464371db4)
Started: October 9, 2023, 11:01 PM (UTC-04:00) — Finished: October 9, 2023, 11:02 PM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 9dd14b41-6df1-476a-9f5e-2c252ab1dfee](https://data.ioos.us/harvest/gcoos-waf-historical/job/9dd14b41-6df1-476a-9f5e-2c252ab1dfee)
Started: October 9, 2023, 10:01 PM (UTC-04:00) — Finished: October 9, 2023, 10:02 PM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified

Then, a successful harvest would run that was able to get a valid response from GCOOS' server, and take about 14 hours to complete (with a number of errors that are to be expected and constitute a 'successful' harvest):

[Job: 94ecc350-c6d0-40da-8dbb-d1fbb34428fd](https://data.ioos.us/harvest/gcoos-waf-historical/job/94ecc350-c6d0-40da-8dbb-d1fbb34428fd)
Started: October 10, 2023, 1:45 AM (UTC-04:00) — Finished: October 10, 2023, 2:44 PM (UTC-04:00)

2657 errors 0 added 2597 updated 0 deleted 0 not modified

Then, it would go back to the roughly hourly job execution pattern:

[Job: f7b50b2f-9f72-4b8e-8a09-d722504c70f4](https://data.ioos.us/harvest/gcoos-waf-historical/job/f7b50b2f-9f72-4b8e-8a09-d722504c70f4)
Started: October 10, 2023, 5:01 PM (UTC-04:00) — Finished: October 11, 2023, 5:12 AM (UTC-04:00)

1260 errors 0 added 1226 updated 0 deleted 0 not modified
[Job: 1562a12a-9d51-49a9-b605-7d9866dc7cd6](https://data.ioos.us/harvest/gcoos-waf-historical/job/1562a12a-9d51-49a9-b605-7d9866dc7cd6)
Started: October 10, 2023, 4:01 PM (UTC-04:00) — Finished: October 10, 2023, 4:02 PM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified
[Job: 4c553d1f-b6a4-4af6-b12f-3110b93dcab9](https://data.ioos.us/harvest/gcoos-waf-historical/job/4c553d1f-b6a4-4af6-b12f-3110b93dcab9)
Started: October 10, 2023, 3:01 PM (UTC-04:00) — Finished: October 10, 2023, 3:02 PM (UTC-04:00)

1 errors 0 added 0 updated 0 deleted 0 not modified

Perhaps there's a script or config somewhere that's restarting jobs around the top of the hour if the previous job reported any errors? Just a guess. What do you think @benjwadams?

mwengren commented 7 months ago

Harvest jobs are still running on a frequency > daily, regardless of what is configured in the CKAN UI.

We tested again with the GCOOS ERDDAP Biological WAF set to Manual, confirmed this is propagated to the database, however there are still routine harvest jobs running.

@benjwadams to look into further troubleshooting.

mwengren commented 5 months ago

@benjwadams I believe this issue is still present as of our Catalog meeting today if you're able to look into now that funding is once again available.

mwengren commented 5 months ago

@benjwadams Once #238 is resolved, can you re-investigate this one? Harvest job UI settings don't seem to be persisted or respected by the CKAN harvesters.

benjwadams commented 4 months ago

This is likely due to ckan harvester job-all https://github.com/ioos/catalog-docker-base/blob/main/contrib/scripts/clear_stuck_harvests.bash#L9 being issued on the stuck job cleanup script. I'm testing removing just this line and think this will fix things.

mwengren commented 4 months ago

@benjwadams mentioned during today's meeting that:

Possibly related to cleanup script manually restarting harvest job on failure. Not honoring the manual harvest config flag in settings.

mwengren commented 2 months ago

@benjwadams stated in https://github.com/ioos/ckanext-ioos-theme/issues/238#issuecomment-2007755461 that CKAN is honoring the harvest job configurations and this issue can be closed.