mozilla / bedrock

Making mozilla.org awesome, one pebble at a time
https://www.mozilla.org
Mozilla Public License 2.0
1.18k stars 919 forks source link

Exclude URLs from XML Sitemaps when they should not be indexed #15540

Closed a-kyne closed 3 days ago

a-kyne commented 4 days ago

Description

There are pages listed in the XML Sitemap that have the noindex tag; those URLs should not be included.

It would probably be best if we do not rely on URL exclusion processes that are vulnerable to human error, e.g. manually adding URLs to a “do not include” list.

Steps to reproduce

  1. Go to Google Search Console for https://www.mozilla.org/
  2. Go to Indexing > Page indexing
  3. Select “All submitted pages”
  4. Under “Why pages aren’t indexed” select “Excluded by noindex tag”
  5. Open one of the URLs listed and View Source.
  6. The source contains a robots tag with a noindex value.

Expected result

No submitted URLs aren't indexed because they are excluded by a noindex tag.

Actual result

101 URLs cannot be indexed because excluded by noindex tag image

image

Environment

n/a

stevejalim commented 4 days ago

Hi @a-kyne - what kind of priority/urgency does this issue need, please?

a-kyne commented 4 days ago

@stevejalim I was just talking Sitemap alternatives with @pmac so I'm going to close this in favor of a different approach.

a-kyne commented 4 days ago

If I can figure out how to close an issue.

janbrasna commented 4 days ago
  1. There's this list to exclude from sitemap: https://github.com/mozilla/bedrock/blob/11024c649edcd896d49c519b77d0036d7a66ee71/bedrock/settings/base.py#L478-L520

  2. Then there are robots.txt exclusions: https://github.com/mozilla/bedrock/blob/11024c649edcd896d49c519b77d0036d7a66ee71/bedrock/mozorg/templates/mozorg/robots.txt#L5-L10

  3. And inline noindex meta tags are in 26 files: https://github.com/search?q=repo%3Amozilla%2Fbedrock+noindex+language%3AHTML&type=code&l=HTML (some already covered by the above, some not…)

These do not always overlap 100% though.

pmac commented 3 days ago

The not always overlapping bit is one thing we're trying to solve for. We're looking at potentially removing the sitemap all together as we're not convinced it's helping much with anything, and that way we'd be able to just keep the noindex tags and have no conflict with anything else.