Block internet search engines from indexing the mirror

bardiharborow commented 6 years ago

If possible, are you able to make your mirror non-indexed by internet search engines? There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

Stebalien commented 6 years ago

There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

The benefit is entirely for clearnet users. Tor users, for example, will (almost) always be able to access Wikipedia over tor so they'll see little benefit.

We should probably add a rel=canonical link pointing to Wikipedia to the head of each page but I haven't thought through the possible ramifications/downsides of this approach.

bardiharborow commented 6 years ago

The benefit is entirely for clearnet users. Tor users, for example, will (almost) always be able to access Wikipedia over tor so they'll see little benefit.

If anonymity is the concern, then accessing IPFS through the ipfs.io endpoint is no more anonymous to large scale surveillance than accessing Wikipedia directly, and if anything I trust the Wikimedia Foundation to handle server logs better than ipfs.io. Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

We should probably add a rel=canonical link pointing to Wikipedia to the head of each page but I haven't thought through the possible ramifications/downsides of this approach.

Doing so would have the intended effect of removing the mirror from Google search results, and it is actually the preferred way to implement this.

Stebalien commented 6 years ago

@bardiharborow

If anonymity is the concern, then accessing IPFS through the ipfs.io endpoint is no more anonymous to large scale surveillance than accessing Wikipedia directly, and if anything I trust the Wikimedia Foundation to handle server logs better than ipfs.io. Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

Ah, I think the confusion may be around the definition of "clearnet". IPFS is a clearnet. That is, it's not a darknet (it provides no anonymity at the moment). Darknets get no benefit because the exit nodes tend to be in countries with strong free speech laws.

Users of the actual IPFS software will presumably discover the mirror through different means than Google, and will not be impacted by this change.

Unlikely. We don't have any IPFS search mechanisms and rely entirely on web search engines. That's probably one of the reasons we don't use rel=canonical links.

rameshvarun commented 6 years ago

+1 for setting rel='canonical' links. I'm starting to see the mirror pop up frequently on the first page of Google results just from normal everyday use. Canonical links should avoid this duplication and make the mirror a good web citizen.

nemobis commented 6 years ago

I agree with adding the rel="canonical": it's annoying to see search duplicates. By not indexing outdated content, you'll also alleviate the concerns with other issues such as https://github.com/ipfs/distributed-wikipedia-mirror/issues/55 https://github.com/ipfs/distributed-wikipedia-mirror/issues/49 .

Actually, what's the purpose of indexing all the pages at all? A noindex meta tag may be appropriate.

wesleylima commented 5 years ago

The lack of canonical tag comes from the htmls generated by kiwix's mwoffiler. I opened an issue https://github.com/openzim/mwoffliner/issues/564

nemobis commented 5 years ago

The lack of canonical tag comes from the htmls generated by kiwix's mwoffiler.

I understand, but you can also add a canonical link in the webserver response headers.

lidel commented 5 years ago

I fixed this upstream (https://github.com/openzim/mwoffliner/pull/963) :ok_hand: Old snapshots are about to be excluded via /robots.txt (https://github.com/ipfs/website/pull/334)

Remaining steps before this issue can be closed:

[x] mwoffliner 1.9 to be released with the fix
[x] wiki snapshots at http://wiki.kiwix.org/wiki/Content_in_all_languages are made with updated mwoffliner and include <link rel="canonical"
[ ] snapshots are put on ipfs + pinned on a reliable cluster
[ ] snapshot-hashes.yml are updated to versions with canonical links

OR:

[x] While https://github.com/openzim/mwoffliner/pull/963 solves problem for new snapshots, it is still possible the script will be run against an old ZIM without the header. Before adding to IPFS the script should check if root document contains the header, and if not manually add it to every document. Filled https://github.com/ipfs/distributed-wikipedia-mirror/issues/65 to track this

I will be checking on mwoffliner/kiwix situation, but if someone has spare bandwidth and can to speed things up, please contribute upstream & post updates here.

lidel commented 3 years ago

This has been fixed by https://github.com/ipfs/distributed-wikipedia-mirror/issues/65 and will be solved upstream when new snapshots are published as part of #60 #61.

ipfs / distributed-wikipedia-mirror

Block internet search engines from indexing the mirror #48