ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Decide how to handle records other than 200 OK responses #32

Open anjackson opened 10 years ago

anjackson commented 10 years ago

A number of potentially useful and interesting aspects of the data are ignored at present, as we only index 200 OK responses. e.g.

These are all potentially useful additional sources of information, but they are difficult to integrate into our current set up as there is no payload to hash. A solution would be to use an ID with no hash prefix (just the latter /md5(URL) part), but it's not clear how this would meaningfully interact with the other entries concerning that URL.

In particular, how would we meaningfully add the target of a redirect to the index?

anjackson commented 10 years ago

Delaying until a later release for now.

anjackson commented 7 years ago

Proposal is to index all request records, and store the status code (see #81). Indexing of request records is deferred until we have a set up for looking up records that 'point to' a given record (#84).

thomasegense commented 7 years ago

The update document strategy conflicts with our way to build the index(assembly line and optimize, write once). The same goes for 'revisit' handling. I will try see what I can come up with to support both strategies.

anjackson commented 7 years ago

In case of misunderstanding, we are also moving away from the strategy of using updates to documents in Solr.

Instead, we are planning to set up some kind of large index that will be used during the indexing process, as the Solr document is constructed, to populate it with additional information that is not held in the current WARC record.