Open anjackson opened 7 years ago
The GDELT Project (by Kalev Leetaru) offers a rich data feed under very open terms.
There is a feed file that is update every 15 minutes that links to a TSV in a ZIP file that be parsed and interpreted to generate a list of URLs with associated metadata (including lat/lon of event).
A task could grab this file, parse and push it into an appropriate crawl stream, if it appears to be in scope. If done with care, this could include some of the additional metadata and pass it along to the indexer.
This kind of thing can be done once we have an easily accessible queue/stream to drop candidates into.
The GDELT Project (by Kalev Leetaru) offers a rich data feed under very open terms.
There is a feed file that is update every 15 minutes that links to a TSV in a ZIP file that be parsed and interpreted to generate a list of URLs with associated metadata (including lat/lon of event).
A task could grab this file, parse and push it into an appropriate crawl stream, if it appears to be in scope. If done with care, this could include some of the additional metadata and pass it along to the indexer.