ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Document harvester getting stuck when connections hang #104

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

The document harvester seems to be hanging when making connections to the web, causing the jobs to lock up in a way that the Airflow system is not finding easy to deal with.

Suggest adding timeouts to the metadata fetching code.

anjackson commented 1 year ago

Seems I missed a timeout on HEAD requests:

https://github.com/ukwa/ukwa-manage/blob/00f84867cb1f55d7ff398894f8a838cfa95ce45b/lib/docharvester/document_mdex.py#L206

anjackson commented 1 year ago

Tagging this as 2.3.2.

anjackson commented 1 year ago

Rolled that out. Looking good but will review later before calling this done.

anjackson commented 1 year ago

Okay, looks good! Will reopen if necessary.