Open anjackson opened 9 years ago
So, the first part is that for multiple crawls, we have decided to have separate documents, and use e.g. group-by-URL to bring them together. This means we have static shards, which are much easier to manage over time.
However, we need to:
Following on from the
crawl_dates
introduced in #21, we have a reasonable but limited solution to handling URL 'lifetimes'. If the records are processed in reverse-chronological order, then thecrawl_date
is the first crawl date and all the crawl dates are in thecrawl_dates
. However, processing content in reverse chronological order is error-prone and unsustainable, and multiple passes lead to duplicate crawl_date entries which may confuse downstream users.We ideally need to index in a smarter way, so the first and last crawl dates can be extracted, and indeed so the overall longevity of a URL can be recorded and made available for faceting. This also links in with the ideas in issue #32, where we might use non-200 responses to make more definable statements about the 'life story' of a URL.
This really requires some kind of intermediary database, which might be some kind of simple CDX-style lookup, or perhaps a full-on intermediary like HBase (e.g. warcbase).
Switching to HBase would also mean we could avoid storing fields in Solr (thus reducing index size), and would make ACT annotation updates to the indexes more scalable. But HBase adds significant complexity to the deployment.
Note that the code currently contains the logic needed to look up URLs in the Solr services and use that to perform the correct duplication. This was found to be rather too slow, but could be revisited.
However, given all the required information is in the CDX files, it may make more sense to avoid adding further dependencies and use Hadoop/HDFS only. If we can debug issues with sorting very large CDXs quickly, then we could use a simple CDX-file lookup. Similarly, we could map CDX file contents to MapFiles, which may be more performant for this use case.