Store index in IPFS for full IPFS-based web archive system

ikreymer commented 7 years ago

It's great to see lots of progress happening on ipwb! However, to me it seems that the key aspect still missing is the ability to store and augment the index (CDXJ) in IPFS as well for a full IPFS system. The user should be able to:

Store and retrieve the index from IPFS for a particular collection.
Augment a collection by adding to the index, eg. by indexing another WARC and merging with the previous index.

Unless I'm missing something, the current system requires that the user maintain the index locally on their own file system (it could also be put into another system, such as Redis, etc...)

This limitation is a key requirement for being to run a web archive entirely on IPFS itself. I remember last year there was discussion of IPLD as being a possible solution.. It's been a while since I had a chance to look at this unfortunately, but I wonder if there have been any new developments or insights?

machawk1 commented 7 years ago

@ikreymer - I brain-dumped something related to this in #60. The idea now is to transmit the CDXJ using your mechanism of choice while both nodes are up. It's sub-optimal. I have not investigated IPLD/IPNS recently beyond the spec's site at http://ipld.io/ .

machawk1 commented 7 years ago

Should the CDXJ indexes somehow be self-aware of their IPFS/IPLD hash? Such data could be stored in the initial CDXJ metadata fields and potentially used in the future to update an index. @ibnesayeed

ibnesayeed commented 7 years ago

I personally don't like the idea of storing CDXJ as such into IPFS. It can be done for the sake of storage, but not for the sake of discovery as the primary goal. However, I very much like to make the system free from the CDXJ and fully working on the IPFS only. The main functionality of the CDXJ is to allow lookup/discovery, which can be done differently such as how Fluidinfo did it. This would require IPFS to support key-value store or linked data style node graph exploration capability.

One possible workflow would be to define a well known starting point for everything. In case of Fluidinfo it is /about/{anything}. So, /about/example.com would provide with a node that will have many tags (relations) pointing to different property nodes related to example.com. One of those property nodes could be pointing to the TimeMap node that further has connected Memento nodes and returns a double linked list or blockchain of mementos. The graph is illustrated in the following figure:

img_20161209_114525

An alternate approach would be to offload the TimeMap generation to the IPNS (which I think is planned/implemented to support Memento protocol to represent past versions). In this approach when a memento is resolved without any datetime, the IPNS will point to the latest Memento, but it can point to any previous version if the datetime is supplied. Additionally, the IPNS should be able to return all versions in the form of a TimeMap. The graph is illustrated in the following figure:

img_20161209_150323

Each Memento node is further decomposed into request record, response headers, and response payload (if any). Additional attributes can be attached to annotate each Memento even further such as to specify which collection it belongs to, who archived it, the name of the WARC file (if the node was imported from a WARC file) and whatnot. To minimize the number of nodes, the main Memento node itself may store the request record, if request records are going to be unique each time. However, it is important to separate response payload from the response header for greater deduplication.

img_20161209_114622

This will even allow import and export of WARC files. Exporting to WARC file can be selective if the user wants to apply some filters. It is important to note that if such an archiving system is created, it is not necessary to use WARC based archiving, independent tools like browser extensions can push observation records (Mementos) into the system with added attributes such as session information and make it globally accessible. This way the archiving system will be fully decentralized and collaborative, both capture and replay.

/cc @jbenet for thoughts.

jbenet commented 7 years ago

This discussion is great -- i need to page in a lot of design considerations here to better understand possibilities & suggest.

In order to get answers faster, and perhaps voice any considerations / feature requests, it may be a good idea to schedule a real time discussion in one of the monday sprints.

cc @flyingzumwalt -- you may be a better person than me to help in this discussion.

cc @nicola @diasdavid @whyrusleeping we should have nice importers for WARCs to IPLD to ensure the deduplication is maximized in use cases like these. This should not take a lot of work, even now. Just requires the kick off of the "importers project".

ibnesayeed commented 7 years ago

@jbenet, how mature is IPLD and where can we read more about it? I had a feeling that it was still in the design phase, perhaps because I read about it a long time ago. Also, how far has IPNS reached in supporting Memento?

Refs:

ikreymer commented 7 years ago

@ibnesayeed I agree that the index should be a native IPFS structure, rather than CDXJ.

Thanks for adding the diagrams, I think all those approaches are good.

I think these link relationships make sense, especially the second option focusing on the TimeGate which links to several Mementos.

Another key part is a list of TimeGates, which would allow building a collection out of multiples urls. There would also need to be a way to expand this collection and additional links. Perhaps IPLD could be used here?

It's been a while since I had a chance to look at this more closely, and I don't have much time either, but would be happy to join a call as @jbenet suggested.

I really hope this problem can be figured out once for all. To me, this is the critical issue that needs to be solved for a true IPFS-based web archiving solution. Not to be overly dramatic, but being able to lookup resources by url and datetime entirely through IPFS is the key difference between just another storage backend to put WARCs into and a ground-breaking, decentralized new web archiving system. I would very much like to see the latter happen :)

whyrusleeping commented 7 years ago

basic IPLD support has been merged into master, and will be released in 0.4.5 (likely before the end of the year). While the tooling around it so far is a bit sparse (we're still figuring out the best interfaces for this), you can use it and resolve things through it already. For example, take this ipld object (written in JSON):

{
        "Hello": "World",
        "cats": {
                "kitty": {"/": "QmZ2MNo4QesxepKgYiFSaHDDYa9wuETKDRW2pPm7Fu6rsp"},
                "moustache": {"/": "QmPPD8EeDsXJVoWEHHWBY2QRB8sejcxanVeqCybstpLkMY"},
                "bulletcat": {"/": "QmS2rX3vcFJdaZxn88rFoknKcLavxbX4kok7xoFfnirSTT"}
        },
        "code": {"/": "QmTG7sMBQXy6niraGSL9mjWvHUHJaMQYKAJ9Stqwmzo992"},
        "catArray": [
                {"/": "QmZ2MNo4QesxepKgYiFSaHDDYa9wuETKDRW2pPm7Fu6rsp"},
                {"/": "QmPPD8EeDsXJVoWEHHWBY2QRB8sejcxanVeqCybstpLkMY"},
                {"/": "QmS2rX3vcFJdaZxn88rFoknKcLavxbX4kok7xoFfnirSTT"}
        ]
}

You can put this into ipfs like:

> cat foo.json | ipfs dag put
zdpuAxSForPxtBPWPQRfnkMvq5W7ooT9uN9xNutvSqypSmToY

And then view paths over this like:

ibnesayeed commented 7 years ago

I think these link relationships make sense, especially the second option focusing on the TimeGate which links to several Mementos.

The second one is my personal preference as well. It off-loads the version resolution to the IPNS and does not pollute the object graph itself. However, the first approach was something that I put here just for comments as I was cooking the possible architectures on the whiteboard. Unnecessary chain of updates is undesired in the the first approach because the TimeMap, the node that keeps pointers to all the mementos would be changed after every memento addition, that will cause unwanted dangling versions of the TimeMap object itself that wont be connected from anywhere. Additionally, reference from the /about/{URI} to the TimeMap node will change as its digest would be different. At the same time, if we decide to add a reference back from the memento node to the corresponding TimeMap node, then all the mementos would need update which would be a disaster.

ibnesayeed commented 7 years ago

Another key part is a list of TimeGates, which would allow building a collection out of multiples urls. There would also need to be a way to expand this collection and additional links. Perhaps IPLD could be used here?

One way of implementing collection building in the IPFS could be as follows:

start with the well known starting point /about (which would be under the namespace of individuals/organizations for the authorization purposes) and use a URN to identify the collection like /about/urn:archive-collections:my%20collection
this collection node would keep a list of reference to about URI-R (illustrated as /about/{URI} in earlier diagrams) nodes like ["/about/example.com", "/about/cnn.com", "..."]
keeping the list inside the collection node itself will make sure that the history of the evolving collection seeds is also being archived and old versions of the collection manifest are also accessible via IPNS
note that we did not use plain list of URI-Rs to define the collection like ["example.com", "cnn.com", "..."] because we want collections to be scoped and not behave like any memento of a given URI-R is part of the collection no matter who archived it; at the same time this style would allow collection builders to pick and choose which namespaces (one or more) should be included in the collection

Let me describe a bit more about the well know entry point /about. It will not be something available in the global namespace where anyone can write, instead, every participant in the IPFS will have an identity (which could be a social network handle, an IPFS block hash, a blockchain address such as namecoin or whatnot) and everyone's /about entry point will be under their respective namespace. This will make sure that people/organizations can participate in web archiving and collection building within their capacity and preferences while utilizing shared deduplication and at the same time the job of web archive aggregators will be safe.

oduwsdl / ipwb

Store index in IPFS for full IPFS-based web archive system #61