kubeflow / website

Kubeflow's public website
Creative Commons Attribution 4.0 International
145 stars 752 forks source link

PR/historical branches are getting indexed by Google #3645

Open chalin opened 7 months ago

chalin commented 7 months ago

Originally posted by @thesuperzapper in https://github.com/kubeflow/website/issues/3628#issuecomment-1841612784:

@chalin Also, all our PR/historical branches are getting indexed by Google, we should fix that at the same time as this PR.

The goals would be:

  1. The main www.kubeflow.org site should be indexed
  2. All PR deploy-preview-XXXX--competent-brattain-de2d6d.netlify.app should NOT be indexed
  3. All other v1-7-branch.kubeflow.org sites should be NOT be indexed:
    • (these are just CNAME records pointing to the branch domains like v1-7-branch--competent-brattain-de2d6d.netlify.app)

I believe your changes here achieve 2, because you are setting -e dev in the hugo command, and because this is not "production", docsy adds <meta> no index tags.

We need to be careful about 1. Are you 100% confident that not setting -e production or HUGO_ENV=production is safe?

To achieve 3, we could set the HUGO_ENV from [context.branch-deploy.environment] to dev, but it will probably propagate faster if we use a robots.txt disallow on those domains (otherwise, the <meta> tags will take until Google next indexes each page).

chalin commented 7 months ago

To achieve 3, we could set the HUGO_ENV from [context.branch-deploy.environment] to dev, but it will probably propagate faster if we use a robots.txt disallow on those domains (otherwise, the <meta> tags will take until Google next indexes each page).

AFAIK, what you propose won't work. I've had to work through a similar issue for another CNCF project with multiple versions of the docs being indexed. Based on my experiences, you'll need to change each old-version branch individually (to somehow set / config it to emit noindex, nofollow as appropriate for the branch) and have it rebuilt and redeployed.

Btw, you can't use robots.txt to prevent domains from being indexed -- see https://developers.google.com/search/docs/crawling-indexing/robots/intro:

image

/cc @nate-double-u

chalin commented 7 months ago

As I mentioned elsewhere, I'm OOO, but I'll be glad to help with this in the new year.

thesuperzapper commented 7 months ago

@chalin It's possible if the Netelify configs are defined for all branches in master (rather than the branches themselves) as discussed here https://github.com/kubeflow/website/pull/3628#discussion_r1416262439, then we might only need to update master, and then trigger a re-deploy of the older Netelify branches.

(However, I think the super new version of Hugo running in master will probably break our really old Docsy versions and the deploy might fail).