ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

IPFS as a backend to a web archiving #28

Open ikreymer opened 8 years ago

ikreymer commented 8 years ago

I am building a new on-demand web archiving system, called webrecorder.io, which allows for on-demand archiving of any web site (by acting as a rewriting + recording proxy). This version (actually beta.webrecorder.io) will soon be open-sourced and will be available for users to deploy on their own.

The system allows for a user to create recording of any web site, including dynamic content, by browsing it through a the recorder, eg. https://webrecorder.io/record/example.com/ and replay by browsing through replay, https://webrecorder.io/replay/example.com/

The recording is a WARC file, a standard used by Internet Archive and other archiving orgs. The file can be broken down into records (basically contents of HTTP response + request and extra metadata), and each of these records could be put individually into IPFS.

I suppose this sort of relates to ipfs/archives#7 but perhaps in a more sophisticated way.

Most obvious mode of operation: Store each WARC record in IPFS individually.

Some unknowns (to me):

For more reference: The system is built using these tools: https://github.com/ikreymer/pywb , https://github.com/ikreymer/warcprox An older simplified version of the "webrecorder" concept: https://github.com/ikreymer/pywb-webrecorder.

ikreymer commented 8 years ago

This would also entail support for the memento protocol relating to ipfs/faq#35

davidar commented 8 years ago

@ikreymer thanks for initiating this, I'd really love to get this working, as it would be very helpful for ipfs/archives :)

Store each WARC record in IPFS individually.

SGTM :)

Resolving URL + TS to the hash of the stored object in IPFS

Looking at this, this would require storing a TimeMap in IPFS for each URI-R? If we assume each TimeGate has its own TimeMap, then this could easily be achieved by pushing TimeMap updates to an IPNS address (ipfs/go-ipfs#1716).

However, ideally we'd also like to be able to aggregate/federate TimeMaps across all TimeGates storing mementos for a given resource, cf ipfs/notes#40 . @ikreymer Does the memento protocol support something like this?

Would want to have users create private archives, or be able to set controls on what is accessible to whom.

IPFS doesn't yet support this, but it is planned (cc @jbenet @whyrusleeping)

jbenet commented 8 years ago

We can probably build a data structure and custom importer for WARC files so that we can traverse into the WARCs with ipfs.

ikreymer commented 8 years ago

Looking at this, this would require storing a TimeMap in IPFS for each URI-R? If we assume each TimeGate has its own TimeMap, then this could easily be achieved by pushing TimeMap updates to an IPNS address (ipfs/go-ipfs#1716).

Hmm, well, the TimeMap does not need to exist as a discreet file, it's basically a query for 'all mementos (archives) of a given url', which can be the result of a query, etc.. The TimeGate is basically a query for 'closest memento (archives) of a given url to a given date (and maybe next, prev dates available)'

I was reading https://ipfs.io/ipfs/QmdPtC3T7Kcu9iJg6hYzLBWR5XCDcYMY7HV685E3kH3EcS/2015/09/15/hosting-a-website-on-ipfs/ -- it seems that perhaps the best solution is just through the file system itself.

Is there support for nested directories? Also, how does the ipfs ls command work, is there any sorting / filtering that is possible?

One idea is to just use a url/date method of A url, http://example.com/ could be added to an archive <hash1> at 201509250000000 (maybe even 17-digit granularity) under

<hash1>/http%3A%2F%2Fexample.com%2F/201509250000000 (url encoded to ensure no extra slashes, other invalid chars)

We can probably build a data structure and custom importer for WARC files so that we can traverse into the WARCs with ipfs.

Yes, I think this may be close to what I was thinking too. A WARC file consists of concatenated gzipped records, and so I'm proposing to store each record as above. The record contains WARC headers, including the url, date, hash digest of the payload, and a unique id (amongst other fields), followed by HTTP headers + HTTP payload.

Since there can be multiple WARC records per timestamp and url, and its often useful to further filter by response records (there may be other WARC records that are not needed for replay,such as request (http request) and metadata (additional metadata), perhaps a storage strategy would be to add:

<hash1>/http%3A%2F%2Fexample.com%2F/201509250000000/<warc record type>/<warc record id>

If this is possible, then searching to see if the archive has http://example.com/ would be just a matter of doing ipfs ls <hash1>

Searching for all records by date (eg. the TimeMap) would just be:

ipfs ls <hash1>/http%3A%2F%2Fexample.com%2F/

Service the HTTP response from 201509250000000 would be just reading the first file from:

<hash1>/http%3A%2F%2Fexample.com%2F/201509250000000/response/

(Actually url is usually canonicalized into a reverse order form, eg. com,example)/ instead of example.com/ but that is a separate issue).

I'm not at all sure if this would work and/or be efficient.

However, ideally we'd also like to be able to aggregate/federate TimeMaps across all TimeGates storing mementos for a given resource, cf ipfs/notes#40 . @ikreymer Does the memento protocol support something like this?

Yes, there is also a concept of a Memento Aggregator! It is mentioned here and described at: http://mementoweb.org/depot/ and they host one which aggregates across multiple web archives. It's not formally defined but I think aggregation is definitely part of the Memento concept.

And here is a new one, being written: https://github.com/oduwsdl/memgator It does what one may expect, query existing memento timegate (over HTTP) and return an aggregate result.

Based on the above idea, this would just query multiple hashes:

ipfs ls <hash1>/http%3A%2F%2Fexample.com%2F/ ipfs ls <hash2>/http%3A%2F%2Fexample.com%2F/ ...

Let me know if any of this makes sense.

ikreymer commented 8 years ago

Also wanted to add here: A key distinction between 'WARC records' and plan static files is that the records are the raw HTTP request and response data (not files), including headers, encoding, etc... The HTTP headers are often important to accurate 'replay' web content.

I can add some examples of WARC files to make it more clear.

davidar commented 8 years ago

the TimeMap does not need to exist as a discreet file

Ah, fair enough.

Is there support for nested directories?

Most definitely

Also, how does the ipfs ls command work, is there any sorting / filtering that is possible?

The API will give you back a JSON object (example, you'll need to copy and paste the link to avoid a CORS error) which you could then process.

<hash1>/http%3A%2F%2Fexample.com%2F/201509250000000 (url encoded to ensure no extra slashes, other invalid chars) Service the HTTP response from 201509250000000 would be just reading the first file from: <hash1>/http%3A%2F%2Fexample.com%2F/201509250000000/response/ (Actually url is usually canonicalized into a reverse order form, eg. com,example)/ instead of example.com/ but that is a separate issue).

That's definitely a possibility. I'd even suggest leaving in the slashes (like wget --mirror does) to retain the directory structure, and also separating domain components into directories, so we don't end up with an enormous root directory. So perhaps something like:

/ipfs/<hash>/http/com/example/www/foo/bar/baz.html/2015/09/25/12/34/56/response/0002.warc

If we assume that most sites don't have identically named subdomains and subdirectories with differing content, then there should be minimal ambiguity (which could be resolved by double-checking the URL in the WARC file). Alternatively, we could separate the domain and path parts like .../www/|/foo/... (since | is an illegal character in URLs)

Based on the above idea, this would just query multiple hashes

Yeah, that would work. I'm also interested in being able to merge everything into a single global tree so the client only has to query a single hash (see #8), but that's still a fair while off.

I can add some examples of WARC files to make it more clear.

I'm roughly familiar with WARC from Common Crawl

Also, the homepage of webrecorder says:

Create high-quality, verifiable archival recording of the content you browse.

How does verification work?

ikreymer commented 8 years ago

Well, taking a step back, I realized that the directory structure is immutable, so perhaps this directory structure idea isn't as useful as I had thought.. I had assumed ti could be used as updatable index, but of course that's not the case..

Hmm.. I think perhaps key-value store or simple database ipfs/ipfs#82 would be useful for querying and update the index..

The directory structure isn't really needed, as it would just indicate a particular set of files in a WARC, or the order in which something was recorded, which is arbitrary.. and not the total archive.

Can the URL and datetime can be embedded as file system metadata stored with the file data? Is that possible?

What is needed is some sort of updatable index... Still trying to understand how IPFS works, sorry :)

How does verification work?

Oh I should probably update that.. It just signs the WARC using https://github.com/ikreymer/warcsigner so that it can be verified that the WARC was created with Webrecorder and not 'tampered with'. Pretty basic.

davidar commented 8 years ago

I realized that the directory structure is immutable, so perhaps this directory structure idea isn't as useful as I had thought.

IPNS provides mutable files and directories, in the same way git does (commits are immutable, but HEAD is not).

It just signs the WARC using https://github.com/ikreymer/warcsigner so that it can be verified that the WARC was created with Webrecorder and not 'tampered with'.

Ah, fair enough (also note that IPNS provides this natively). I tried looking into whether TLS could be (ab)used to provide server-signed content, but apparently not (the closest thing I found was https://tlsnotary.org/ which isn't really helpful).

ikreymer commented 8 years ago

IPNS provides mutable files and directories, in the same way git does (commits are immutable, but HEAD is not)

Hm. I see, how are simultaneous updates handled? Is there an equivalent of a merge operation?

jbenet commented 8 years ago

merge operation

not built yet, but yes it's doable. highly app dependent, so we're still playing with designs

rht commented 8 years ago

not built yet, but yes it's doable. highly app dependent, so we're still playing with designs

You mean different difftool depending on the data format type?

ikreymer commented 8 years ago

Thanks for the quick responses everyone. I guess the next step is for me to try and build a quick prototype of writing WARC records into IPFS and playing back archived content from IPFS, in the simplest way possible. Hopefully will get a chance to try that soon. My tools are all python based, so I think I should be able to use https://github.com/ipfs/python-ipfs-api I'll send an update when I have something to show.

davidar commented 8 years ago

@ikreymer SGTM, looking forward to it :)

davidar commented 8 years ago

My tools are all python based, so I think I should be able to use https://github.com/ipfs/python-ipfs-api

Cc: @amstocker

amstocker commented 8 years ago

@ikreymer I would be glad to help you out, so please let me know if you have questions. The python API client is pretty much stable but if you have any issues also definitely let me know.

davidar commented 8 years ago

We now have a place to coordinate porting/building apps on top of IPFS, so I've opened an issue there to discuss the details of integrating WebRecorder.io specifically with IPFS (ipfs/apps#3).

I'll leave this issue open to discuss archiving/recording web pages onto IPFS more generally, including discussions about how to store such data on IPFS such as directory naming conventions, etc.

rht commented 8 years ago

merge operation

relevant href https://news.ycombinator.com/item?id=3946856

ikreymer commented 8 years ago

Thanks everyone for your help. Sorry for delay, I have been busy releasing a new version of webrecorder.io, now out.. Also, happy to announce that webrecorder is now fully open source, at https://github.com/webrecorder/webrecorder

There's still a lot to be done before integration would be possible, but I am planning to work on a separate prototype write and replaying WARCs for now..

Also, look forward to stopping by the Tuesday meetup in SF and meeting folks in person..

davidar commented 8 years ago

@ikreymer Great news, looking forward to when integration becomes possible :)

PS: I just noticed you seem to be involved with hypothes.is? I've just started a discussion about IPFS integration you might be interested in :)

ikreymer commented 8 years ago

@davidar @amstocker Here is a very very rough prototype, that allows users to browse and "record" into IPFS as they browse, and replay back from IPFS. Each (gzipped) WARC record is stored individually under a url-encoded name of the URL.

https://github.com/ikreymer/pywb-ipfs/

After running the app, visit http://localhost:9080/record/<url> to record that url into IPFS, and http://localhost:9080/replay/<url> to play it back from an IPFS-based WARC record.

Redis is used to update a sorted index of URLs in real-time, although a copy is then pushed into IPFS every few seconds..

I'm not sure if this is the right approach, but just a start.

jbenet commented 8 years ago

This is a really cool recorder. would be awesome to get it into a chrome extension. (can py be put into chrome? might have to be js)

davidar commented 8 years ago

can py be put into chrome? might have to be js

Hopefully this should become easier when wasm happens (or maybe even with native-client now). There's also various py->js compilers, but I'm not sure how well any of them work

ikreymer commented 8 years ago

This is a really cool recorder. would be awesome to get it into a chrome extension. (can py be put into chrome? might have to be js)

Well, I am actually trying to avoid browser plugins, as I think that is limiting to a specific browser, requires user installation, and is harder to maintain. I think this makes sense as a server-side service, which in my experience is more robust, and can support any modern browser, including mobile.

I do have some questions about how to structure the data.. right now, each HTTP response is its own warc record, and there is only an index that points to each one. It could be interesting to use the dag nature of IPFS to create links between urls, but not sure what the best approach would be.. Probably periodically "committing" the current recording?

Since the recording is open-ended, there is no definite end.. for example, could "commit" a linked page structure (based on referrer) on page load, then if user interacts with page, or scrolls down, additional content could be recorded, so the recorded could "commit" again, and then if user navigates to another page and that loads another page, that could also be committed.

Then there is the issue of merging multiple recordings.. currently, I'm just updating the IPNS name with a cumulative index, but of course ideal is to merge multiple indices..

And it was great to present at SF meetup. Perhaps a discussion on IRC is better, if so, let me know, and I can jump in.

jbenet commented 8 years ago

@ikreymer indeed, thanks for coming.

I do have some questions about how to structure the data.. right now, each HTTP response is its own warc record, and there is only an index that points to each one. It could be interesting to use the dag nature of IPFS to create links between urls, but not sure what the best approach would be.. Probably periodically "committing" the current recording?

I suspect we can do some clever importing (transform) of a WARC file into IPFS dag nodes. (like what ipfs tar does). That may be the easiest way to support the archives themselves.

Since the recording is open-ended, there is no definite end.. for example, could "commit" a linked page structure (based on referrer) on page load, then if user interacts with page, or scrolls down, additional content could be recorded, so the recorded could "commit" again, and then if user navigates to another page and that loads another page, that could also be committed.

Then there is the issue of merging multiple recordings.. currently, I'm just updating the IPNS name with a cumulative index, but of course ideal is to merge multiple indices..

Can use a commit chain, like with git. we'll have those soon. https://github.com/ipfs/notes/issues/23

ikreymer commented 8 years ago

I suspect we can do some clever importing (transform) of a WARC file into IPFS dag nodes. (like what ipfs tar does). That may be the easiest way to support the archives themselves.

Is there more information on how this should work? Perhaps we should discuss at some point?

I'm thinking it would be good to figure out how to properly deal with WARC files in a general sense, as this may also affect ArchiveTeam #36 and #39 as these will have a lot of WARC files (though of course not only WARC files)

davidar commented 8 years ago

Perhaps we should discuss at some point?

Yes.

I was also talking to @nbp the other day about how we could handle Nix archive (NAR) files.

@jbenet where's the best place for these discussions to happen?

jbenet commented 8 years ago

Feel free to open other notes anywhere-- if love for discussions to happen in our archives or notes repo, but whatever works!

— Sent from Mailbox

On Fri, Nov 6, 2015 at 7:11 PM, David A Roberts notifications@github.com wrote:

Perhaps we should discuss at some point? Yes. I was also talking to @nbp the other day about how we could handle Nix archive (NAR) files.

@jbenet where's the best place for these discussions to happen?

Reply to this email directly or view it on GitHub: https://github.com/ipfs/archives/issues/28#issuecomment-154608724

ikreymer commented 8 years ago

@davidar @jbenet i was thinking perhaps to set aside a time to chat on irc?

davidar commented 8 years ago

@ikreymer Sure, I'm free now, not sure about @jbenet

ikreymer commented 8 years ago

@davidar Can't do now unfortunately, but lets pick a time that works for everyone..

davidar commented 8 years ago

0700-1200 UTC usually works for me

jbenet commented 8 years ago

sorry my avail will be sparse before thu this week. meet without me i'd say, and i can look over a proposed WARC design?

harlantwood commented 8 years ago

@travisfw this thread may interest you...

ikreymer commented 8 years ago

To restart the discussion, I thought it probably make sense to delve into the structure of a WARC record.

The WARC record is basically consists of WARC (mime-style) headers, followed by HTTP response (HTTP headers + HTTP payload). It is designed to be easily appended to a previous entry (for example, by a crawler)

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:6d058047-ede2-4a13-be79-90c17c631dd4>
WARC-Date: 2014-01-03T03:03:21Z
Content-Length: 1610
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
WARC-Target-URI: http://example.com?example=1
[... other WARC headers here]

HTTP/1.1 200 OK
Content-Type: ...
Content-Length...
[... other HTTP headers here]

<doctype !html>
...

Thus far, I have been storing this entire block into IPFS, but this may not be the optimum way. By design, each WARC record will be different, as it contains a unique id and a unique timestamp.

The WARC-Payload-Digest field is designed to store a hash of the payload, which of course IPFS can compute automatically..

Perhaps then, to store a WARC record, it makes sense to then serialize the HTTP payload separately from WARC headers + HTTP headers?

(Just for reference, here are some slides I have about the structure of the WARC format: http://ikreymer.github.io/talk-20151215/#/warcformat0)

When storing duplicate content, only a new WARC headers + HTTP headers entry can be added, and the HTTP payload would be matched by an existing hash. The WARC spec already supports this exact use case for deduplication.

The larger goal here is to be able to accurately ingest existing web archives (WARC records) into IPFS, and create new archives compatible with existing web archiving software. This ensures, for instance, that HTTP headers are preserved as well, which are often needed for accurate replay in some cases (cookies, custom headers, etc..)

davidar commented 8 years ago

Perhaps then, to store a WARC record, it makes sense to then serialize the HTTP payload separately from WARC headers + HTTP headers?

:+1:

IPLD might also allow us to store the headers as proper key-value mappings (metadata).

Cc: @mildred

mildred commented 8 years ago

IPLD would allow you to do the same thing that you can do with the current format, except that IPLD already provides you with a structured data model compatible with JSON. What you probably want to do is store the headers and then a link to the payload.

ikreymer commented 8 years ago

Thanks @mildred @davidar I was not familiar with IPLD.. Are there some examples that I can look at? Are there other tools that use it directly?

JesseWeinstein commented 8 years ago

I'm having trouble finding the IPLD spec. I found this: https://github.com/candeira/specs/blob/52f2a673df33b06e4408100fc468eea78d0f2cae/merkledag/ipld.md

and I found two implementations: https://github.com/ipfs/go-ipld and https://github.com/diasdavid/js-ipld

(edit: Ah, maybe this is the definitive PR? https://github.com/ipfs/specs/pull/37 )

mildred commented 8 years ago

IPLD is not yet ready (that's why it's still a pull request) and I don't think the implementations are ready yet (at least go-ipld isn't). Bu basically, it replaces the current protocol buffer implementation in go-ipfs/merkledag with a JSON-compatible data structure.

This data structure is free for application implementors (you for example that want to store some specific data structure) to use. If you think of your data structure in JSON, you can be sure to be able to store it in IPLD.

IPLD adds a link mechanism to allow linking IPLD documents together. A link in IPLD is represented by a JSON object like this one:

{
  "link": "<base58 hash of the linked object>"
}

This object can contain other properties you might want to store for the link.

ikreymer commented 8 years ago

@jbenet and I chatted General plan is to create an IPLD spec for WARC, once IPLD is ready. (http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf) and then to create an importer to ingest WARCs, perhaps as ipfs warc cmd.

Various ways to do it: either split HTTP payload, HTTP headers, WARC headers as separate objects linked together, or add all headers as part of IPLD structure. Probably separate objects make sense so that WARC digest entries can just be IPFS hashes.

I will look at existing spec and offer more specific thoughts.

davidar commented 8 years ago

@ikreymer SGTM :)

machawk1 commented 8 years ago

For reference @ibnesayeed and I hacked together InterPlanetary Wayback for the Archives Unleashed Hackathon in early March to get our hands dirty and experiment with WARC+IPFS interfacing.

The approach we initially took was similar to the first way @ikreymer described: we chopped up WARC files into WARC headers, HTTP headers, and HTTP payload; extracted relevant values from the WARC headers, discarding the rest; then added the temp files created from the extracted parts via a local IPFS daemon instance. It sorta worked but we hope to develop it further in a less hacky way.