ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Decide how to handle multiple crawls and URL 'lifetime' #39

Open anjackson opened 9 years ago

anjackson commented 9 years ago

Following on from the crawl_dates introduced in #21, we have a reasonable but limited solution to handling URL 'lifetimes'. If the records are processed in reverse-chronological order, then the crawl_date is the first crawl date and all the crawl dates are in the crawl_dates. However, processing content in reverse chronological order is error-prone and unsustainable, and multiple passes lead to duplicate crawl_date entries which may confuse downstream users.

We ideally need to index in a smarter way, so the first and last crawl dates can be extracted, and indeed so the overall longevity of a URL can be recorded and made available for faceting. This also links in with the ideas in issue #32, where we might use non-200 responses to make more definable statements about the 'life story' of a URL.

This really requires some kind of intermediary database, which might be some kind of simple CDX-style lookup, or perhaps a full-on intermediary like HBase (e.g. warcbase).

Switching to HBase would also mean we could avoid storing fields in Solr (thus reducing index size), and would make ACT annotation updates to the indexes more scalable. But HBase adds significant complexity to the deployment.

Note that the code currently contains the logic needed to look up URLs in the Solr services and use that to perform the correct duplication. This was found to be rather too slow, but could be revisited.

However, given all the required information is in the CDX files, it may make more sense to avoid adding further dependencies and use Hadoop/HDFS only. If we can debug issues with sorting very large CDXs quickly, then we could use a simple CDX-file lookup. Similarly, we could map CDX file contents to MapFiles, which may be more performant for this use case.

anjackson commented 9 years ago

So, the first part is that for multiple crawls, we have decided to have separate documents, and use e.g. group-by-URL to bring them together. This means we have static shards, which are much easier to manage over time.

However, we need to: