Website mirroring - Githubissues

davidar commented 8 years ago

TL;DR: People should be able to simply run:

ipfs-mirror http://example.com/

without having to worry about copyright violations, etc.

There are several open-access collections that could be archived by simply spidering their website, in the same way that Google Cache or IA's Wayback Machine does. Of course, this should only be performed for the portions of the website not disallowed by robots.txt.

IANAL, but from what I can tell, this is all kosher so long as there's an appropriate procedure for opting out. According to this article (which links to this document), Google is safe because they allow webmasters opt out via robots.txt, and also has a process for responding to DMCA takedown requests.

This is the policy that the Internet Archive follows:

Request by a webmaster of a private (non-governmental) web site, typically for reasons of privacy, defamation, or embarrassment.

Archivists should provide a 'self-service' approach site owners can use to remove their materials based on the use of the robots.txt standard.

Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site.

This allows archivists to ensure that material will no longer be gathered or made available.

These requests will not be made public; however, archivists should retain copies of all removal requests.

Third party removal requests based on the Digital Millennium Copyright Act of 1998 (DMCA).

Archivists should attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.

If the claim appears valid, archivists should comply.

Archivists will strive to make DMCA requests public via Chilling Effects, and notify searchers when requested pages have been removed.

Archivists will notify the webmaster of the affected site, generally via email.

Third party removal requests based on non-DMCA intellectual property claims (including trademark, trade secret).

Archivists will attempt to verify the validity of the claim by checking whether the original pages have been taken down, and if appropriate, requesting the ruling(s) regarding the original site.

If the original pages have been removed and the archivist has determined that removal from public servers is appropriate, then the archivists will remove the pages from their public servers.

Archivists will strive to make these requests public via Chilling Effects, and notify searchers when requested pages have been removed.

Archivists will notify the webmaster of the affected site, generally via email

Third party removal requests based on objection to controversial content (e.g. political, religious, and other beliefs). [...] archivists should not generally act on these requests.

Third party removal requests based on objection to disclosure of personal data provided in confidence. [...] These requests are generally treated as requests by authors or publishers of original data.

Requests by governments. Archivists will exercise best-efforts compliance with applicable court orders

Other requests and grievances, including underlying rights issues, error correction and version control, and re-insertions of web sites based on change of ownership. These are handled on a case by case basis by the archive and its advisors.

Anyway, it would be really helpful if IPFS had an official procedure regarding this (presumably gateway-dmca-denylist would be a part of this).

hsanjuan commented 4 years ago

I think someone might have written such tool, crawling a website + adding to ipfs, but I am not sure anymore. Would be great if someone can comment. It might just also be a concatenation of wget+ipfs add.

sysfu commented 4 years ago

There are already several more or less mature tools that allow users to download entire websites and store them locally on a hard drive. Perhaps the initial focus should be on figuring how to insert these local web site mirrors into the ipfs network.

Once that part of the problem is solved and stable, additional development resources can go towards developing the component that downloads existing website structures, and immediately uploads them into ipfs without the need of a local cache or copy.

ipfs-inactive / archives

Website mirroring #7