twisted-infra / braid

Automation scripts for twistedmatrix.com
Other
6 stars 12 forks source link

Disallow all but the most current API/Lore documentation from being indexed by search engines #118

Open adiroiban opened 8 years ago

adiroiban commented 8 years ago

This is a reminder to review and import/discard the changes from https://github.com/twisted-infra/t-web/pull/4

@hawkowl maybe you can resubmit a PR on braid.

Thanks!

cdunklau commented 5 years ago

@hawkowl's original PR uses a robot.txt change that stops crawlers from indexing /documents except for /documents/current. The PR has this text, included for context:

This change would disallow Google/Bing/et al from indexing old versions of the documentation, and just 'current'. The upside of this is that googling for "twisted api documentation" will then bring up the most recent always (and not 13.1.0 like it does currently). The downside to this is that it makes searching for older versions of the API documentation much harder (eg. "twisted 8.2 api documentation" wouldn't come up with anything).

The idea of attempting to bias search results towards current docs is IMO quite important. "Google shows old docs" is a complaint I hear often when discussing Twisted with other people, and I'm personally getting kinda tired of the fastest path to current docs being: search google/etc, click old docs link, click (admittedly very helpful) current docs link, and finally get what I wanted to see.

That being said, the suggested robots.txt change seems quite heavy-handed, and I'm confident that there's a way to accomplish current-version docs prioritization with more finesse. After I brought it up on IRC, @exarkun and @altendky suggested a couple potential techniques:

Having skimmed a bit about both of these ideas, I suspect neither is a perfect fit, but this at least can give a starting point for further investigation. Prior art could be a useful resource... I've noticed that search results for Python's docs tend to be just the /2/ or /3/ versions, and read the docs links tend to come up pointing at "stable" or "latest".

If anyone has additional ideas or observations, I'd love to hear them.

Julian commented 5 years ago

I have no useful ideas other than yes this is turrible and is the way I've used the docs for as long as I can remember, and also more importantly that this has always worked well from what I can remember on RTD, so whatever they do works decently and possibly can just be copied.

altendky commented 5 years ago

@cdunklau do you have concerns with the site map functionality? Or just that we have to write (or find) a tool to create it? It seems like expressing a priority is what we would want. I would think old pages should still be findable, just not by general searches that would also find new pages.

Also, there are certain Python docs that consistently come up as 3.3 and 3.1. (sorry, can't remember which. I just tried and couldn't find them.)

cdunklau commented 5 years ago

I should stress that I have basically no experience or knowledge of SEO, crawlers, and search engine indexing behavior. I'm mostly trying to stimulate discussion about how to get it done, and hope for someone else to pave the way. I'm willing to do the implementation work, though.

That said....

https://www.sitemaps.org/protocol.html#xmlTagDefinitions has this to say about <priority>:

The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites—it only lets the search engines know which pages you deem most important for the crawlers.

So it seems to be a site-global importance metric, not relative to certain other pages as I had initially hoped. This seems that it would be too coarse of a knob... but I'm not sure.

OTOH the <link rel="canonical"> idea initially seemed to fall short at least based on the examples on the wikipedia page showing changes in domain or query strings... but now that I look at it again, its supposed to actually point at the "canonical" version, not mark the current page as canonical as I'd originally thought. This might be the right approach!

If the docs pages could be rewritten to have that link tag, perhaps that would be enough. It seems like https://github.com/twisted-infra/braid/blob/master/services/t-web/docs/website-template.tpl would be the place to do this, but I'm not sure how you'd get the "current page path" in the template, in order to make the link tag.

glyph commented 5 years ago

(I have nothing to add but I just wanted to say thanks @cdunklau @Julian @altendky for picking up this thorny and intractible problem again)

cdunklau commented 5 years ago

@glyph It's almost like OSS was a good idea after all :)

But while you're here... could you perhaps give a high-level overview (or link) about how the documentation makes its way from a particular Twisted release into the https://twistedmatrix.com/documents/ tree? I'm hoping for at least some links to the individual bits in the chain, but some details would be much appreciated, especially those that could help run the Vagrant setup in a way where one could see the tweaks to the documentation "live" as it were.

glyph commented 5 years ago

I haven't done this in many years but luckily it's documented here: https://twisted.readthedocs.io/en/latest/core/development/policy/release-process.html#update-documentation

@hawkowl might be able to elaborate further.

cdunklau commented 5 years ago

Thanks, that should get me started. Not too sure when I'll get around to this... if someone with issue edit rights could assign it to me, that might make it less likely that I forget about it :)

adiroiban commented 5 years ago

well... this is GitHub... so no easy way to interact with "strangers".

I sent an invite to @cdunklau for Braid repo, with write access. Once accepted I hope that you can assign this ticket.

Thanks!

cdunklau commented 5 years ago

@adiroiban thanks!

altendky commented 5 years ago

@cdunklau, the canonical link seemed inherently wrong in my head. It would be stating that all the doc pages for a particular thing are the same, as I read it, and that all search results should go to the canonical page. They certainly are not and should not. An explicit search for twisted 8.0.0 deferred ought to end up at the v8 docs. Also, the default priority is stated to be 0.5 so it's easy enough to just not even document the current region in the site map and just make all the old pages be 0.2 or some gradient based on age or... Old pages shouldn't come up for 'regular' searches but they should be findable if the search terms can overcome the priority bias towards current.

Despite describing our site incorrectly in my judgement, I will acknowledge that the canonical link certainly is a more direct knob to use. No parameter to explore and tune.

cdunklau commented 5 years ago

It would be stating that all the doc pages for a particular thing are the same, as I read it...

Yeah, this is how I read it too. https://support.google.com/webmasters/answer/139066?hl=en seems to confirm that.

The thing I'm concerned about with the site map idea is its global nature... it looks like we'd have to make a site map for all the things we want indexed, not just docs. If there was a separate subdomain just for docs, a site map would clearly be a useful and likely easy-to-implement solution... but ISTM that generating a site map for the current site could be quite onerous.

<priority> looks to be less useful than we thought, as Google claims not to consume it... so I think the only <lastmod> and <changefreq> could help. I wonder if just the temporal things would help...

altendky commented 5 years ago

Google claims not to consume it...

Well by golly gee that's no good. Aside from that making this irrelevant, I figured you could document only the things you want to provide extra info for. This could be only the not-current documentation directories and everything else would be left at default priority. But oh well.