ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Improve Document Harvester tools #84

Closed anjackson closed 2 years ago

anjackson commented 2 years ago

The core document harvester is working as before, via the docharv process command. But, some improvements would make it easier.

Some example documents

2021-12-01 14:09:08,352: INFO - lib.docharvester.to_w3act - Sending doc: {'document_url': 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1034406/test-and-trace-week-76.pdf', 'title': 'Weekly statistics for NHS Test and Trace (England): 4 to 10 November 2021', 'filename': 'test-and-trace-week-76.pdf', 'landing_page_url': 'https://www.gov.uk/government/publications/weekly-statistics-for-nhs-test-and-trace-england-4-to-10-november-2021', 'source': 'tid:114728:https://www.gov.uk/government/collections/slides-and-datasets-to-accompany-coronavirus-press-conferences', 'wayback_timestamp': '20211119001232', 'launch_id': None, 'job_name': 'frequent-npld', 'size': 1433224, 'target_id': 147227, 'status': 'ACCEPTED', 'publication_date': '2021-11-18T15:00:04.000+00:00', 'publishers': ['UK Health Security Agency'], 'publisher': 'UK Health Security Agency'}
2021-12-01 14:10:11,009: ERROR - lib.docharvester.to_w3act - The document has been REJECTED! : {'document_url': 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/588647/Smart_Systems_Forum_expression_of_interest_letter.pdf', 'title': '', 'filename': 'Smart_Systems_Forum_expression_of_interest_letter.pdf', 'landing_page_url': 'https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/588647/Smart_Systems_Forum_expression_of_interest_letter.pdf', 'source': 'tid:46004:https://energyinst.org/', 'wayback_timestamp': '20211119081406', 'launch_id': None, 'job_name': 'frequent-npld', 'size': 64377, 'target_id': None, 'status': 'REJECTED', 'api_call_failed': "Could not find rel['up'] relationship.", 'match_failed': True}
anjackson commented 2 years ago

Can use DOCUMENTS_FOUND_DB_URI environment variable to change location of database.

anjackson commented 2 years ago

Also the CLI has been modified to allow the DB URI to be set from there too.

anjackson commented 2 years ago

Okay, new Airflow version addresses these issues.