rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.92k stars 12.53k forks source link

Canonical urls for deduplication of google results in rustdoc #9461

Closed Seldaek closed 1 year ago

Seldaek commented 10 years ago

When multiple versions of the documentation are available, it tends to pollute google results. As a way to prevent that, it would be good to always have the latest stable release available under /current/, and have all previous versions + the master docs contain canonical links to the current docs like:

<link rel="canonical" href="http://.../current/..." />

That way it consolidates all results under the current URL which will always be correct, and it also encourages people linking to docs in blog posts and such to use links that will not rot.

/cc @alexcrichton

alexcrichton commented 10 years ago

Do you know how search engines handle situations where pages go away or pages are just created? In theory old documentation could refer to a canonical location which no longer exists (if the module were removed), and new documentation could refer to canonical locations which do not yet exist (because they're newly added modules).

Do you know of special attributes to handle these cases?

thestinger commented 10 years ago

If a module is removed, a 404 is correct. In theory it would be better to redirect them on renames but it's not going to be possible because it's not tracked.

The point of a canonical URL is to say that the page is only a non-canonical version of another URL and shouldn't show up separate in search. When we eventually have supported versions, the newest release (or master) can be given as the canonical one so the older pages won't clutter search results but will be available via a drop-down menu.

Of course, if the newer version does not have the module, you would have to omit stating it is the canonical URL - meaning you need to regenerate the old documentation every time you do the new ones. I don't think it's worth the complexity.

thestinger commented 10 years ago

FWIW I think we should only have documentation on the site for releases we still support. Until we get to 1.0, we can make an exception for the last 0.x snapshot :).

chris-morgan commented 10 years ago

When a module is removed, 404 is indeed correct, but just remember that that's not the end of the story, as I wrote recently about at http://chrismorgan.info/blog/github-links-case-study.html.

What the Django docs do is worthwhile considering: https://docs.djangoproject.com/. It makes it easy to switch between versions and shows a warning banner for the development build suggesting you may want to look at the latest stable instead. They don't, however, have a banner reminding you "this isn't the latest stable version" for old versions, which continues to surprise me a little. I reckon old versions (though not before 1.0 after a while) should stay in existence but with a banner at the top indicating that this is an unsupported release, and docs for the latest version, X.Y, are available in such-and-such a place. Of course, these things become much more directly applicable once we get to 1.0 and beyond.

@alexcrichton I guess in the no-longer-exists case you'd need to either implement something so that you can conveniently reprocess the old docs, or do a little bit of post-processing to fix the "errors". For the doesn't-yet-exist case, checking online or comparing crates (which sounds risky) would be the only real ways, I suppose.

steveklabnik commented 9 years ago

Triage: no change.

steveklabnik commented 8 years ago

Triage: no changes

SamWhited commented 7 years ago

(sorry for the duplicate; moving relavant link here)

See also: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Choosing_between_www_and_non-www_URLs#Using_%3Clink_relcanonical%3E

This page from Google's help center [1] appears to suggest that they only use the canonical URL as a hint. While this doesn't explicitly say it, this seems to concur with behavior I've seen in the past where if the page pointed to by the canonical URL is a 404, Google simply uses the original URL (which I suspect is what we want in this case since it makes it easy: point canonical url's to the /stable path and if the module is deleted it doesn't really matter).

sanmai-NL commented 7 years ago

I would like to re-open discussion on this issue (@steveklabnik, I think you would be the one to ping).

With https://docs.rs/ in place, I think all rustdoc documentation for public crates should provide a canonical link to the appropriate documentation there. Why? Search for any popular crate on Google and you get a litter of confusing, often outdated self-hosted versions of the docs. This may lead the programmer to accidentally study outdated publications of documentation (e.g. when not depending on a specific version of a library), and use and perhaps even bookmark publications that may be partially broken or are not as continuously available as on https://docs.rs/.

Google's former SEO representative has indicated that Google may disregard canonical links that result in 404 HTTP response codes. Since Google is by far the most used search engine, and it addressed this issue in a sensible way already, I personally take little issue with the possibility of 404 canonical links.

Here's my logic:

steveklabnik commented 7 years ago

I would like to re-open discussion on this issue (@steveklabnik, I think you would be the one to ping).

No need to re-open anything 😄 It's an open issue.

With https://docs.rs/ in place, I think all rustdoc documentation for public crates should provide a canonical link to the appropriate documentation there

That would be nice, but without some improvements to docs.rs, it's not feasible. There are several people who do extra things to make their docs nicer and explicitly don't want their docs hosted on docs.rs at all.

sanmai-NL commented 7 years ago

Interesting, could you point me to examples of what those people do?

steveklabnik commented 7 years ago

I believe @briansmith, @retep998 , and @bluss are three of those people?

retep998 commented 7 years ago

I'm perfectly fine with my docs being hosted on docs.rs. I just haven't actually published a new winapi since docs.rs gained Windows support. There's a few features I'm waiting for like the ability to specify the default target to show docs for and which cargo features to enable. But once that's all set then I'd much rather use docs.rs than have to deal with rustdoc generating a hundred thousand files and then committing them to git and pushing them (which is a really slow process).

There's a few copies of winapi documentation floating on the internet from other people's personal project documentation being published and I really wish they wouldn't exist because they interfere with search results. Sometimes I'll lookup some obscure windows function and the only results will be someone's rustdoc generated documentation that happens to include winapi.

bluss commented 7 years ago

@sanmai-NL A crate needs to be compiled to generate its docs, and the dependencies might not be present on docs.rs's builders, nor is there yet any way to indicate what dependencies to use.

My crates they should all have migrated their docs to docs.rs except ndarray. ndarray has lots of optional crate features and I want their items to be visible in the docs (and such items are marked in their doc string). It's not a big thing, but ndarray's docs are therefore technically superiour outside docs.rs. It also has blue boxes for example code, which is obviously nicer to the eye :wink:

And by the way, here's a group of crates where an author has done an amazing job with non-docs.rs docs http://nalgebra.org/

sanmai-NL commented 7 years ago

Thanks for the comments. I deduce two extra issues. First, https://docs.rs should provide complete documentation and it should combine well with features and optional dependencies, and it seems not to. Secondly, sometimes https://docs.rs docs should not be the canonical variant anyway. IMO, it should be canonical by default, and this may be overridden with some configuration setting coming with the source tree.

bluss commented 7 years ago

onur/docs.rs/pull/73 can fix some of these issues

sanmai-NL commented 7 years ago

The optional configuration setting may be a string that is a URL to the canonical API docs.

briansmith commented 7 years ago

A while back, I filed https://github.com/onur/docs.rs/issues/74 to have docs.rs include the canonical link, and @onur committed at least one change towards making that happen.

https://github.com/onur/docs.rs/issues/73 will help a lot with the current main concrete problem with doc.rs. In the meantime I added a note to my documentation: “IMPORTANT: If you are reading this on docs.rs or another third-party site, you may not be seeing the complete documentation due to their limitations. Read it at https://briansmith.org/rustdoc/ring/signature/ instead.”

skade commented 7 years ago

I see the problem that projects may want their official doc pages as the canonical page. Making docs.rs the canonical URL by default might give credit where no credit is due.

A stopgap would be a noindex tag for dependencies (#41882).

luser commented 6 years ago

The docs.rs stuff feels like a separate discussion that could have its own issue. I think fixing the "every release version on doc.rust-lang.org shows up in Google search results" is a specific thing that's important to fix, and using <link rel="canonical"> to point at the stable docs sounds like the simplest fix.

luser commented 6 years ago

After poking around the rustdoc sources a little bit I have a concrete proposal. rustdoc already supports several options on the #[doc] attribute to control HTML output, such as html_favicon_url: https://doc.rust-lang.org/rustdoc/the-doc-attribute.html#at-the-crate-level

We should add support for a html_canonical_base_url option, and add it to the crates that wind up as part of the std documentation like: #![doc(html_canonical_base_url = "https://doc.rust-lang.org/stable/")]

It would be picked up and stored into the SharedContext along with the other attributes here: https://github.com/rust-lang/rust/blob/a85417f5938023d1491b44d94da705f539bb8b17/src/librustdoc/html/render.rs#L532

Callers of render would need to pass down the relative URL from the root for a page, possibly as a member of Page itself: https://github.com/rust-lang/rust/blob/a85417f5938023d1491b44d94da705f539bb8b17/src/librustdoc/html/layout.rs#L25

This seems to mostly be useful for Context::item, which constructs on-disk paths: https://github.com/rust-lang/rust/blob/a85417f5938023d1491b44d94da705f539bb8b17/src/librustdoc/html/render.rs#L1478

which calls Context::render_item: https://github.com/rust-lang/rust/blob/a85417f5938023d1491b44d94da705f539bb8b17/src/librustdoc/html/render.rs#L1410

which calls layout::render. This might require some variation of format::href, which currently generates URLs relative to the current URL in order to generate URLs relative to the base URL: https://github.com/rust-lang/rust/blob/a85417f5938023d1491b44d94da705f539bb8b17/src/librustdoc/html/format.rs#L393

Then finally, layout::render could join the base canonical URL, if present, with the relative URL to the page and include a <link rel="canonical" href="{canonical_url}">.

steveklabnik commented 5 years ago

Triage: there has been some small movement; by now, this issue is getting larger and larger, and is affecting more and more people. I hope to have a plan sometime in the near-ish future; we'l see.

pietroalbini commented 4 years ago

By the way, the issue with doc.rust-lang.org has been fixed, as we now have a robots.txt in place.

jsha commented 2 years ago

I propose closing as a duplicate of https://github.com/rust-lang/docs.rs/issues/1438.

workingjubilee commented 1 year ago

Closing as this seems fixed/taken on by docs.rs