oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
610 stars 39 forks source link

Allow replication of an external archive's URI-R captures based on a TimeMap #181

Open machawk1 opened 7 years ago

machawk1 commented 7 years ago

Web archives that support Memento serve an endpoint that provides a TimeMap for a URI-R.

For sites whose captures I am interested in (e.g., my homepage), I may want to retain a local copy of the archives' holdings. A personal instance of ipwb might be suitable for this (albeit pie-in-the-sky at the moment).

  1. From ipwb, allow a user to specify a URI-R and a TimeMap source. The source may be a Memento aggregator or a single Web archive (e.g., Internet Archive).
  2. Fetch the TimeMap.
  3. For each URI-M in the TimeMap, fetch the payload, inclusive of the HTTP headers (and recurse for embedded resources), retaining an association with the URI-M and datetime (though these may be parsed out from the archived HTTP headers).
  4. Extract the necessarily information from the response (e.g., Memento-Datetime)
  5. Consume the extracted using the ipwb indexing procedure, pushing into IPFS and retaining the content hashes.
  6. Create a CDXJ using the existing ipwb logic inclusive of the hashes, URI-R, datetime, and any other necessary (e.g., MIME-type) or desired (page title) information.

This implementation would likely need to rely in a separate entity beyond the indexer and replay system, but could tie into the indexer, per above. The replay interface may serve as a more user-friendly means to accomplish this.

Steps 1-3 can be accomplished independent of ipwb and might be useful for other projects.

ibnesayeed commented 7 years ago

WebRecorder has this feature of recording captures from another archive and generate WARCs. It even takes care of un-rewriting when necessary. Either that component of WebRecorder can be used to generate WARCs then indexed as usual or we can request the original captures of each URI-M in the TimeMap (using id_ trick) then push to the IPFS and index at the same time as you proposed. One thing to note though, captures based on TMs will be atomic to each URI-R in the sense that we don't expect a complete composite memento to be preserved this way.

As far as the implementation is concerned, I think we can rethink the indexer as to what is the source. Current implementation takes care of the WARC as source, but a live site or another archive can play the role of the source equally well.

machawk1 commented 7 years ago

It would be great to be able to reuse Webrecorder code to accomplish this (insight as to where in the code this is performed, @ikreymer ?) but I believe we do not need to convert the header and payload to the WARC format to accomplish this (albeit, it would allow us to reuse some of our existing code). The id_ is hacky and may be inconsistent across archives. This feature may also be a potential use case for https://github.com/webrecorder/public-web-archives , which we could consult for determining the path to get the un-rewritten/raw payload.

We have previously discussed isolating the features from the indexer on a per-task basis, as the indexer currently does more than indexing (e.g., pushes to IPFS).

ibnesayeed commented 7 years ago

We have previously discussed isolating the features from the indexer on a per-task basis, as the indexer currently does more than indexing (e.g., pushes to IPFS).

This decoupling wont be easy as the index requires IPFS hashes that are returned after pushing content in IPFS. So, even if we put that logic in a separate file, the system needs to run in pipe mode which wont be any better. However, the source reading logic (currently only available for WARCs) can be extracted out and fed into the indexer as a dependency injection or automatic detection. The indexer in that case would require a uniform API from any source driver we have. One possibility would be to use WARCIO library to read WARCs as source and implement the subset of its APIs (only stuff that is necessary for indexing and pushing to the IPFS) for other sources.