Open anjackson opened 10 years ago
Delaying until a later release for now.
Proposal is to index all request
records, and store the status code (see #81). Indexing of request
records is deferred until we have a set up for looking up records that 'point to' a given record (#84).
The update document strategy conflicts with our way to build the index(assembly line and optimize, write once). The same goes for 'revisit' handling. I will try see what I can come up with to support both strategies.
In case of misunderstanding, we are also moving away from the strategy of using updates to documents in Solr.
Instead, we are planning to set up some kind of large index that will be used during the indexing process, as the Solr document is constructed, to populate it with additional information that is not held in the current WARC record.
A number of potentially useful and interesting aspects of the data are ignored at present, as we only index 200 OK responses. e.g.
warc/request
records to allow discovery path to be explored.These are all potentially useful additional sources of information, but they are difficult to integrate into our current set up as there is no payload to hash. A solution would be to use an ID with no hash prefix (just the latter
/md5(URL)
part), but it's not clear how this would meaningfully interact with the other entries concerning that URL.In particular, how would we meaningfully add the target of a redirect to the index?