simonmichael / hledger_site

The repo for hledger.org, the hledger project's website.
GNU General Public License v3.0
63 stars 37 forks source link

Seemingly duplicate results in search engines like Google #66

Closed Flimm closed 1 year ago

Flimm commented 2 years ago

Let's say I'm trying to find some documentation about postings. I open Google and search for "hledger posting". These are the results that I see:

image

As you can see, the first results look similar. The first three have the same title "journal manual - hledger", and they have the same breadcrumb "https://hledger.org › journal". The links of the first three results are:

(At the time of writing, the latest version of hledger is 1.25)

As you can see, these results are basically the same page, but for different versions of hledger. As you can imagine, this can make it harder to find what I was looking for.

Here are some suggestions:

simonmichael commented 2 years ago

Great suggestions, thanks. I'll work on this if no-one beats me to it.

simonmichael commented 2 years ago

I believe this is done. Changes:

A test search: https://www.google.com/search?q=hledger+posting Old: New:
Screen Shot 2022-05-06 at 03 04 11
simonmichael commented 2 years ago

I think search results (even just Google's) may take a long time to clean up, and probably it would be wise to create a sitemap.xml to help that along. Generating that is not supported in released mdbook just yet; I'd welcome suggestions on what to put in it.

simonmichael commented 2 years ago

Basic sitemap created, google reindexing pending.

Flimm commented 2 years ago

That looks great! Thank you.

It occurred to me that would be better for SEO purposes to have stable URLs that are included in the Google index. What if https://hledger.org/stable/hledger.html worked (and didn't redirect anywhere)? That way, the /stable/ URLs could collect Google juice and improve their ranking.

A URL like https://hledger.org/1.25/hledger.html collects Google juice, but at some point it gets wasted, when a new version comes out. It gets wasted when the old version gets the noindex tag. Even without that tag, it gets wasted, since all the links on the web pointing to it do not get updated to point to the new version, and the Google juice gets divided up between multiple URLs. Sorry that I didn't think of suggesting this before.

simonmichael commented 2 years ago

Good thoughts. My intent was always to have the easy https://hledger.org/hledger.html (hledger-ui.html, hledger-web.html) be the stable URLs for the manuals of the current release. IIRC previously this was done with symlinks or copies, and both URLs existed on the web. With the latest changes, /hledger.html is a redirect to /CURRENTVERSION/hledger.html, ie still a sort of "stable URL", and I think I saw today in google search console that they are correctly guessing /hledger.html as the canonical URL.

/hledger.html is missing from the new sitemap, though, so I should maybe add it there.

Though with reindexing still pending, it's a little hard to be sure what's what. I'm assuming and hoping the old manuals will disappear from google search results fairly soon, because of the sitemap I've submitted which does not include them, and/or because they now contain noindex tags.

simonmichael commented 2 years ago

I'm slightly baffled. Taking https://hledger.org/1.0/hledger.html as an example old manual page, it now has the noindex tag, and a sitemap not including it was successfully submitted. After several days its Coverage status remains "Indexed, not submitted in sitemap / URL is on Google / It can appear in Google Search results (if not subject to a manual action or removal request)". Its last crawl date is.. May 1, 2022. When I request reindexing google says it can't be indexed because of the noindex tag. Docs say not to request removal, and to rely on the sitemap or noindex tag + reindexing instead. So... keep waiting and it will happen ?

Flimm commented 2 years ago

I had a look at https://hledger.org/1.0/hledger.html . It seems that GoogleBot last crawled this page on 4 May 2022. That was before the noindex changes were rolled out. So we need to wait for GoogleBot to crawl this page again, or somehow prompt Googlebot to do that.

It's worthwhile distinguishing between the concept of crawling and indexing. We want Googlebot to crawl these pages, but we don't want it to index them. You said the tool informed you that the page "can't be indexed because of the noindex tag". I think that's the message we want and expect. I'm not sure why the page is still in the Google search results if it can't be indexed.

I also noticed that most of the URLs in the sitemap https://hledger.org/sitemap.xml are broken. Here is the second item in the site map:

<url>
  <loc>https://hledger.org/ACHIEVEMENTS</loc>
  <lastmod>2022-05-08T18:26:14.569Z</lastmod>
</url>

If I visit https://hledger.org/ACHIEVEMENTS, I get a 404 error. A lot of the other URLs are broken, too.

simonmichael commented 2 years ago

It seemed to me that it's normal for sitemap.xml to omit the .html suffix, is that wrong ?

Flimm commented 2 years ago

I'm pretty sure that the URLs in a sitemap have to be a complete URL. You can't omit the .html suffix from the URLs if that is what the URLs contain.

Flimm commented 2 years ago

It looks like this particular URL has been removed from Google's index now, but some of the other URLs haven't been recrawled yet.

Flimm commented 2 years ago

It seems like Google has recrawled most of the URLs by now. The sitemap still contains invalid URLs.

simonmichael commented 1 year ago

One year later... @Flimm do you still see problems that should be fixed ? We want the best possible indexing, but on the other hand without spending a ton of time.

Flimm commented 1 year ago

Looks good to me! Thank you for fixing this. It definitely makes using hledger easier, as it's now easier to look up relevant documentation and discussion on Google.