mozilla / bedrock

Making mozilla.org awesome, one pebble at a time
https://www.mozilla.org
Mozilla Public License 2.0
1.18k stars 916 forks source link

Do not redirect users/bots to localized content based on header information #11538

Closed slightlyoffbeat closed 1 year ago

slightlyoffbeat commented 2 years ago

Description

Googlebot does not use the language header, and specifically asks that websites do not redirect visitors to localized content based on header information. Currently, we redirect Googlebot from any unlocalized link that it crawls to en-US content, which creates a strong indexing bias towards that locale.

Original proposal

Proposed Solution:

GET https://www.mozilla.org/ :

When the user does not declare a language, or declares a language that we do not support: Serve the “Choose your language” page as the root domain response. Redirect the URL …/locales/ to https://www.mozilla.org/ and remove the Languages link from the footer.

When the user declares a language that we support: Redirect them, with a 302, to the appropriate page.

GET other unlocalized URLs:

If the user does not declare a language, or if there is no version of that page for that language: Serve a custom 404 that offers links to all of the localized versions of that page. (Therefore if Googlebot crawls an unlocalized URL, it gets a 404, and that non-existent URL does not appear in the index)

When the user declares a language that we support: Redirect them, with a 302, to the appropriate page.


Success Criteria

slightlyoffbeat commented 2 years ago

After speaking to Pmac, working team for this is @robhudson and Adria. Adria will reach out and set up time to discuss and onboard.

Lets keep Pmac and myself looped in to proposals and work.

alexgibson commented 2 years ago

If the /locales/ page is something we might redirect users to, then we might also consider fixing https://github.com/mozilla/bedrock/issues/6454 to make the list of locales there more complete / configurable. Right now the template is just hardcoded, and is likely already incomplete.

robhudson commented 2 years ago

I've re-written the above logic with some extra edge cases and noting where the proposed changes differ with current behavior. Can you let me know if this looks correct?

1. Requests with NO accept-language header (primarly for the Google bot):
    a. Is the URL a non-locale-prefixed page? (e.g. /credits, /robots.txt)
        - YES: render it
        - NO: pass through
    b. Is the requested URL prefixed with a locale?
        - YES: pass through
        - NO: return 404 with supported locales for path ->
    c. Are there translations for this locale and URL?
        - YES: render it
        - NO: return 404 with supported locales for path ->  
2. Requests with accept-language header (people using browsers):
    a. Is the URL a non-locale-prefixed page? (e.g. /credits, /robots.txt)
        - YES: render it
        - NO: pass through
    b. Is the requested URL prefixed with a locale?
        - YES, user requested /{locale}/{path}/: 
            i. Are there translations for this locale and URL?
                - YES: render it
                - NO: determine best match based on language header and available translations:
                    - if any, redirect to best matching locale ->
                    - if none, return 404 with supported locales for path ->
                        - Note: We currently redirect to first available translation
        - NO, user requested /{path}/:
            i. Is there a matching locale based on available translations and user's language header?
                - YES: redirect to /{locale}/{path}/ where locale is best match ->
                - NO: return 404 with supported locales for path ->
                    - Note: We currently redirect to first available translation

I'll point out the "no match" case where a user (not a bot) requested a URL but there are no available translations that match the user's accept-language preferences. Currently we redirect to the first defined translation, but perhaps it is better to 404 with the page showing all available translations for the user to choose?

I'm also not sure exactly what is meant by "remove the Languages link from the footer" but perhaps we can land these changes first, then circle back around and make the languages link and select box list of languages a bit of a nicer end-user experience?

slightlyoffbeat commented 2 years ago

A quick note on the footer language link: I agree that I'd prefer to hold on footer changes for now.

This looks correct to me. I'd like to tag @pmac for an additional set of eyes on this.

Can we have analytics in place to see how many people hit a 404 for 2b, and perhaps even with info on their language header? It would be good to monitor.

pmac commented 2 years ago

Rob and I have been discussing a lot. We're both curious about you and Adria's thoughts on the UX impact of the 404s that we would show to real users. We could choose to not change that behavior for now, but it does seem like the right thing to do if we really don't have any match for their accept-language header values. I think that case should be a small number, but good call on making sure we can measure that impact.

a-kyne commented 2 years ago

Where are we with this? Is there a blocker?

pmac commented 2 years ago

Making progress. @robhudson is working on it now. No blockers that I know of. It is a major change to how the site works though, so it's a lot of careful work.

robhudson commented 2 years ago

The work in PR associated with this issue may also satisfy issue #9233 by providing a route to contribute on pages with incomplete translations.

a-kyne commented 2 years ago

Hi, given that there are some business issues with the redirection solution that we're not sure how to resolve right now, let's put redirection aside. However, rather than redirecting the root domain using the current localization, we will serve a "Choose your language" page at the root domain https://www.mozilla.org/ . The page will link to all of the translated home pages, and enable Googlebot to find home pages for each of the translated languages, rather than only for en-US.

robhudson commented 1 year ago

Curious if we still need this issue open? If not, can we summarize here how things ended up? /cc @a-kyne

a-kyne commented 1 year ago

I was unable to access the web server data that would tell us how many times users request unlocalized URLs other than the Privacy Policy, or indeed if there are any unlocalized URLs other than the home page that Google is requesting. Consequently I am still not sure if it would be safe to change our localization strategy for requests for non-Privacy Policy unlocalized URLs.

We have redirected the root domain, however, and that is probably enough to prevent major issues from developing. (current misdirected traffic is low, e.g. about 1k impressions and 125 clicks/month on en-US URLs for search queries containing فاير for example)

Let's close.