Open kevinjwalters opened 5 years ago
Got an update from GH today,
I'm not quite sure of the reason behind excluding wiki from Google's index but I'll pass your request onto the team to consider.
I can't promise if or when it will be implemented, but your feedback is definitely in the right hands!
I urge the Github team to remove this restriction on robots scanning the github wiki pages.
I put a great deal of effort into providing wiki pages that would assist users of my open source software. I also hoped that they would help potential users find the project by providing meaningful content related to the problems my software addresses. The fact that Google cannot index my pages seriously limits the effectiveness of that content.
For example, I've written an article on Natural Neighbor Interpolation, which is a function my software supports. It's a specialty topic and the information I supply is not well-covered elsewhere. Enough people have linked to my article that if your run a Google search on "Natural Neighbor Interpolation" my wiki page comes up as the fourth item in the search. But, disappointingly, the description line on Google's search page reads "No information is available for this page".
Therefore I respectfully request that Github reconsider its position on restricting web crawlers from indexing wiki pages.
Same problem here. Any updates on GitHub removing this entry (it does more harm than good)
Still blocked from crawlers.
Please can GitHub remove the restriction on Google, etc crawing Wiki pages. I want my Wiki to be seen!
GitHub should remove this entry in the robots.txt file and let the repo owner descide. The default setting for a wiki page could be "noindex, nofollow" set in a meta tag but it should be possible to unset it.
They appear to have shifted to a custom crawling process. First two lines of current robots.txt
:
# If you would like to crawl GitHub contact us via https://support.github.com/contact/
# We also provide an extensive API: https://developer.github.com/
Google and DuckDuckGo still aren't indexing the GitHub wiki pages. I found another obscure search engine called Bing, that also gives no results for wiki. I've not had any further updates from them. I'll prod them again to see why they've ignored this request and persist in a partial crawl of GitHub.
For reference: github.com.robots.20210120.txt
I just put a new support ticket in for GitHub to review this fiasco and I mentioned this ticket for detail and support for the fix.
GitHub support says:
According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.)
Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories.
I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes.
So sorry about that; I completely understand why this could be a blocker.
Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.
Kevin responds:
It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?
wiki not be crawlable is a nonsense
The comma.ai community put a lot of work into the FAQ and many other pages. It's a bummer that it isn't indexed. I'm sure a few other projects have similar wikis with lots of content in them that are pretty much invisible.
Maybe there should be a warning put on the Wiki functionality that the content in Wikis is generally invisible to search engines.
The suggestion that a 'closed' Wiki that does not allow comments should be eligible to be crawled sounds sensible to me. This would stop people spamming GitHub, and would allow each project to decide if they wanted to make their Wiki searchable.
In any event, if someone wanted to spam GitHub, most projects allow issues to be raised. The argument that preventing a Wiki from being crawled is to stop spamming is a bit thin because Issues could just as easily be used as a vector for trolling.
Please allow projects to make their Wiki crawlable.
As a way of sharing useful information, a wiki's whole purpose is defeated if it cannot be used to do as much as widely as their creators deem as applicable. Sure there should be a way to allow "private" wikis, but there should also be a way to have public ones. Otherwise projects will use other services to host such things (which I've seen in the past and not understood until now).
Setting non-crawlable as a default seems reasonable, but not allowing projects to choose otherwise does not. Please reconsider.
I think that the URL is visible to Google and other search engines. When I search for terms that match that, the URLs are bolded with the search terms and they do come up in the search. I am not sure if the content is used though.
If you've ever searched for something that exist on StackOverflow, you may have noticed some mirrors of StackOverflow content mirroring also ranking highly. I don't particularly like these operations but maybe what they're doing can help here.
I hastily made this service to try to get the comma.ai openpilot wiki content indexed:
https://github-wiki-see.page/m/commaai/openpilot/wiki
It's quite sloppy but it should work for other wikis too if a relevant link is placed in a crawlable place. I'm no SEO expert so this experiment may very well crater but I figured I'll try something for not a lot of money. I doubt it'll rank highly since there are no links to it and it is in no way canonical.
I've also made some PRs as you can see in the issue reference alerts to update the GitHub documentation. In it, I've also suggested adding that users who want content that is crawlable and accepting of public contributions to produce a GitHub Page site backed by a public repository. To be honest though, setting up that setup kind of a pain in the ass for all parties and we're all lazy bastards.
💸
I ran this big boy of a query in BigQuery as part of my project to generate sitemaps for my workaround:
#standardSQL
CREATE TEMPORARY FUNCTION
parsePayload(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """ try { return JSON.parse(payload).pages.reduce((a,
s) => {a.push(s.html_url); return a},
[]); } catch (e) { return []; } """;
SELECT
*
FROM (
WITH
parsed_payloads AS (
SELECT
parsePayload(payload) AS html_urls,
created_at
FROM
`githubarchive.month.*`
WHERE type = "GollumEvent")
SELECT
DISTINCT html_url,
created_at,
ROW_NUMBER() OVER(PARTITION BY html_url ORDER BY created_at DESC) AS rn
FROM
parsed_payloads
CROSS JOIN
UNNEST(parsed_payloads.html_urls) AS html_url)
WHERE
rn = 1
AND html_url NOT LIKE "%/wiki/Home"
AND html_url NOT LIKE "%/wiki/_Sidebar"
AND html_url NOT LIKE "%/wiki/_Footer"
AND html_url NOT LIKE "%/wiki/_Header"
$45-less later, I had a list of 4,566,331 wiki pages that have been touched over the last decade excluding Home and trimmings. That's a lot of content being excluded from robots.txt
!
I've saved the results into the publically accessible github-wiki-see.show.touched_wiki_pages_upto_202106
table if anyone else wants a gander. It's a small ~500MB dataset compared to the $45's worth of 9TB I had BQ crunch through.
I've also been using the litmus test of openpilot wiki nissan
and openpilot wiki nissan leaf
to see what search engines do about GitHub wikis. If the terms are in the URLs, a result does show up:
If you searched for openpilot wiki nissan leaf
though, no results show up in Google. As a sidenote, my GHWSEE tool does show up in DDG/Bing though 😄 :
I think search engines don't index the content if robots.txt
excludes them but they do index the link components.
I've since produced a new BigQuery table and a new bundle of sitemaps from that that has checked all the links and only includes 200
s: github-wiki-see.show.checked_touched_wiki_pages_upto_202106
. There are 2,090,792 200
'ing pages.
GitHub support says:
According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.) Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories. I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes. So sorry about that; I completely understand why this could be a blocker. Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.
Kevin responds:
It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?
FWIW, I've made my mirroring tool append the attribute rel="nofollow ugc"
to any links going outside of GitHub. Maybe they could do something like this if they decide to change their minds.
It turns out they already attach rel="nofollow"
to external links but not rel="nofollow ugc"
.
GitHub currently has a
robots.txt
which is preventing crawling of the paths associated with the Wiki area for each and every repository. This is explicit and looks very intentional. I've asked about this (19-Oct-2019) and got no response, ticket number is430217
.I've attached the current (27-Oct-2019) robots.txt file.
github.com.robots.20191027.txt
The gist of it:
I would like this to change to make the Wiki areas searchable using popular search engines.