oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
602 stars 40 forks source link

Allow indexing and replay from live Web contents #211

Open machawk1 opened 7 years ago

machawk1 commented 7 years ago

https://github.com/VictorBjelkholm/ipfscrape is a command-line tool to allow a user to enter a URI, wgets the content from the live Web, pushed it to IPFS, then serves it from localhost.

https://github.com/webrecorder/warcio is a python modules that provides the ability to create a WARC file from a live Web page.

ipwb currently has the ability to go from warc->ipfs->ipwb replay

Integrate with warcio to (write warc via warcio)->warc->ipfs->ipwb replay

machawk1 commented 7 years ago

Q: for testing, can we instruct the CI system to kill the network connection mid-way through the test? This would allow us to test the feature in this issue post-push at pull/replay-time.

machawk1 commented 7 years ago

Related: https://github.com/jbenet/http2ipfs

machawk1 commented 7 years ago

Also related: https://github.com/ikreymer/pywb-ipfs/ via https://github.com/ipfs/archives/issues/28

machawk1 commented 7 years ago

@b5 from @datatogether stated that this week he is hoping to get a proof of concept to exhibit the following procedure:

  1. Start with a user-generated collection of URLs. Allow users to fire off a "task" that will...
  2. Generate a WARC of that collection using https://github.com/datatogether/warc
  3. Generate an IPWB-Compatible CDXJ file.
  4. Put all of that on IPFS
  5. Demo the WARC in IPWB.

Stand by and keep an eye on these efforts.