ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Add task to pull URLs and metadata in from GDELT #22

Open anjackson opened 7 years ago

anjackson commented 7 years ago

The GDELT Project (by Kalev Leetaru) offers a rich data feed under very open terms.

There is a feed file that is update every 15 minutes that links to a TSV in a ZIP file that be parsed and interpreted to generate a list of URLs with associated metadata (including lat/lon of event).

A task could grab this file, parse and push it into an appropriate crawl stream, if it appears to be in scope. If done with care, this could include some of the additional metadata and pass it along to the indexer.

anjackson commented 5 years ago

This kind of thing can be done once we have an easily accessible queue/stream to drop candidates into.