vercel / next.js

The React Framework
https://nextjs.org
MIT License
124.99k stars 26.7k forks source link

locale negotiation is incorrect for i18n routing #18676

Open longlho opened 3 years ago

longlho commented 3 years ago

Bug report

Describe the bug

If I send Accept-Language: "fr-XX,en" and I have available locales fr, en, it should give me fr instead of en (current behavior)

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Should be routed to fr

Screenshots

If applicable, add screenshots to help explain your problem.

System information

Additional context

Add any other context about the problem here.

longlho commented 3 years ago

this might be something formatjs can provide actually

longlho commented 3 years ago

@Timer I believe u can use the ResolveLocale abstract operation from ECMA402 (spec). We do have an implementation in @formatjs/ecma402-abstract: https://codesandbox.io/s/affectionate-chatelet-nvgl7?file=/src/App.tsx

ResolveLocale(
    ["fr", "en"],
    ["fr-XX", "en"],
    { localeMatcher: "best fit" },
    [],
    {},
    () => "en"
)

should yield

{
    "locale":"fr",
    "dataLocale":"fr"
}
arguiot commented 3 years ago

I have the same issue here... very very annoying.

WsCandy commented 3 years ago

I've just encountered this in a real world scenario, I have Italian set up as an it subpath. Desktop browsers set to Italian seem to match it. Safari on iOS only sends it-it in their headers, this results on the page falling back to English.

Here's a link: https://i18n.wscandy.vercel.app/

arguiot commented 3 years ago

@ijjk You have assigned yourself to this issue, but to save you time, I tried to understand how NextJS took into account the Accept-Language header, and realized that this part was managed by the hapi/accept module: https://github.com/vercel/next.js/blob/master/packages/next/next-server/server/next-server.ts#L328

So, I've written a small script that could replace the accept.language method:

const regex = /((([a-zA-Z]+(-[a-zA-Z0-9]+){0,2})|\*)(;q=[0-1](\.[0-9]+)?)?)*/g;

function parse(al){
    const strings = (al || "").match(regex);
    return strings.map(m => {
        if(!m){
            return;
        }

        const bits = m.split(';');
        const ietf = bits[0].split('-');
        const hasScript = ietf.length === 3;

        return {
            code: ietf[0],
            script: hasScript ? ietf[1] : null,
            region: hasScript ? ietf[2] : ietf[1],
            quality: bits[1] ? parseFloat(bits[1].split('=')[1]) : 1.0
        };
    }).filter(r => r).sort((a, b) => b.quality - a.quality);
}

function pick(supportedLanguages, acceptLanguage) {
    const parsedAcceptedLanguage = parse(acceptLanguage);

    const supported = supportedLanguages.map(support => {
        const bits = support.split('-');
        const hasScript = bits.length === 3;

        return {
            code: bits[0],
            script: hasScript ? bits[1] : null,
            region: hasScript ? bits[2] : bits[1]
        };
    });

    for (let i = 0; i < parsedAcceptedLanguage.length; i++) {
        const lang = parsedAcceptedLanguage[i];
        const langCode = lang.code.toLowerCase();
        const langRegion = lang.region ? lang.region.toLowerCase() : lang.region;
        const langScript = lang.script ? lang.script.toLowerCase() : lang.script;
        let possible = []
        for (let j = 0; j < supported.length; j++) {
            const supportedCode = supported[j].code.toLowerCase();
            const supportedScript = supported[j].script ? supported[j].script.toLowerCase() : supported[j].script;
            const supportedRegion = supported[j].region ? supported[j].region.toLowerCase() : supported[j].region;
            if (langCode === supportedCode) {
                possible.push({ lang: supportedLanguages[j], supportedScript, supportedRegion });
            }
        }
        if (possible.length > 0) {
            const regionsFilter = possible.filter(({ supportedRegion }) => supportedRegion == langRegion)
            if (regionsFilter.length > 0) {
                const scriptFilter = regionsFilter.filter(({ supportedScript }) => supportedScript == langScript)
                if (scriptFilter.length > 0) {
                    return scriptFilter[0].lang
                }
                return regionsFilter[0].lang
            } else {
                return possible[0].lang
            }
        }
    }

    return null;
}

Which, once implemented/imported in the server (line 328) would resolve into:

pick(i18n.locales, req.headers['accept-language'])

I've tested my script on multiple cases, and I found it always took the right option (but I encourage to try it yourself). The only thing is that it works better by adding general language code before region/script specific languages: for example, to make sure en is selected it has to be placed first:

const languages = ["en", "en-US", "en-GB"]
pick(languages, "en-AU,en-US;q=0.9,en;q=0.8")
longlho commented 3 years ago

@arguiot unfortunately your locale negotiation algorithm is not correct. There are currently 2 algorithms:

RFC4647 is pretty rudimentary while UTS35 is more sophisticated and handle legacy aliases (e.g zh-TW -> zh-Hant-TW). libicu uses UTS35 and that's also what formatjs simulate as well.

arguiot commented 3 years ago

@longlho Which part is wrong exactly? The links you gave aren't algorithm at all, they are specifications (which is quite different). Currently, the problem is not about parsing Accept-Language header or selecting the most relevant locale, but a bit of both.

I'm not saying my algorithm is perfect, because it's not the case and not the goal. The problem with specifications is that not everyone follows them. The language picker has to be very flexible to make sure that it works for everyone.

There's room for improvement (so if you want to improve my work or someone else's work, please do). But I think it's better to quickly fix the issue and improve the language detection overtime. I believe it's always better to have something pretty reliable right now that will get better than having something absolutely perfect but in 3 months.

longlho commented 3 years ago

I'm not sure you understand locale negotiation. You need language matching data from CLDR to resolve them correctly. Just parsing locales and doing matching based on lang/region/script is not enough. The 2 algorithms are described in the 2 specs if you read them. "Not everyone following them" because they don't really understand its complexity and choose to just hack together solutions that work in the 20 tests they have.

I don't recommend just hacking a solution. Either use a library that follows the algorithm or implement the algorithm itself.

Formatjs already follows the spec so either use that or libicu

arguiot commented 3 years ago

I probably don't understand locale negotiation as well as you. I would like to understand why parsing locales and doing matching based on lang/region/script is not enough in our case.

The problem with what you're proposing is that it's not related to Accept-Language headers at all. Which basically means that in the ECMA402 implementation q weightings aren't respected.

Moreover, we would have to parse the Accept-Language ourself. I already solved the problem in my implementation.

I agree on the fact that it's better to use FormatJS, but we first need to solve the q weightings problem first.

Finally, the ResolveLocale functions makes bad decisions:

ResolveLocale(
            ["en-CA", "fr-CA"],
            ["fr-FR"],
            { localeMatcher: "best fit" },
            [],
            {},
            () => "en"
)

In that case, it should return fr-CA because it's the closest locale, but it returns en.

longlho commented 3 years ago

Q weightings determines the order of the requested locale array.

And in your example, yes, it should return "en". You're assuming if you understand a language in a certain region then you understand that in all regions, which is not true for traditional vs simplified Chinese, or latin american spanish vs spain spanish.

Using lang/region/script is not enough. Because zh-TW matches with zh-Hant, not zh-Hans or zh and that data lives in CLDR.

arguiot commented 3 years ago

@longlho I know what Q weightings are, but does the order matter in the ResolveLocale function?

This is a very tricky point, and I think we should let the developer decide. Maybe by letting the developer use its own resolver in next.config.js? And make the ECMA402 resolver the default one.

The problem is that not every browser will send the header with the region and the code: Chrome, will do it but Safari will only send en-US for example.

For global websites, having a perfect language resolver matters a lot. But for country specific website, like in Canada for example, only the language code matters.

I would like to have your opinion on this proposal. For me I think it would solve the problem (and close our little debate :smile:).

longlho commented 3 years ago

The order absolutely matters in ResolveLocale. And tbh you should NOT let developers control the resolver bc it requires fairly extensive knowledge on how locale negotiation & CLDR data works, in conjunction with things like upgrading libicu/CLDR/unicode versions. Unless you have a dedicated team for it, it's not something that an average dev wants to deal with.

The problem is that not every browser will send the header with the region and the code: Chrome, will do it but Safari will only send en-US for example.

That's not entirely correct. Locale preferences are determined by both browser user settings and OSes. It contains more than just lang/script/region. If u read the UTS35 LDML spec it contains preferences like calendar, numbering systems and such (e.g ar-nu-latn is Arabic using Latin numbering system instead of Arabic ones). Calendar/numbering systems can be pulled from the OS by the browser. Browsers can choose to send those or not, due to potential fingerprinting issue.

For global websites, having a perfect language resolver matters a lot. But for country specific website, like in Canada for example, only the language code matters.

Canada has 2 official languages: English & French. CA also uses Celsius instead of Fahrenheit. There are a lot of differences in en vs en-CA (number grouping, grouping symbol, currencies...). I guess I'm not sure what you meant here.

Having worked with large engineering org I have not seen the need to have "custom" i18n functionalities that are not encompassed by Unicode or ICU due to high learning curve and complexity, so providing something standard out of the box is the norm.

arguiot commented 3 years ago

I must admit that you convinced me. I created a Pull Request (#19460) based on your code. It would be great if you could review the code and make sure everything is working properly.

flayks commented 3 years ago

Experiencing the same issue on a Next 10 sites using built-in localization. The client using iOS Safari (I can reproduce it on my iPad) is not redirected to /fr but keeps seeing the content of the site with the default language (en). Works on any other browser though.

longlho commented 3 years ago

~No "en" is correct. en-US is just an alias for en~

Sorry I misread your function, turns out it was using an older version of ResolveLocale, the latest version should resolve to zh-CN. I've updated the codesandbox as well.

myriky commented 3 years ago

same issue here.

i18n: {
    locales: ['ko', 'en', 'fr', 'nl'],
    defaultLocale: 'en',
  }

i'm prefer using in ko-kr. each browser sent difference format in Accept-Language Header like this.

os x, chrome ✅ Accept-Language: ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7 스크린샷 2021-01-01 오후 6 46 47

os x, safari ❌ Accept-Language: ko-kr 스크린샷 2021-01-01 오후 6 46 37

ios, chrome ❌ Accept-Language: ko-kr 스크린샷 2021-01-01 오후 6 50 19

windows 10, edge ✅ Accept-Language: ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7 KakaoTalk_Photo_2021-01-01-18-57-07

windows 10, chrome ✅ Accept-Language: ko,en;q=0.9,en-US;q=0.8 KakaoTalk_Photo_2021-01-01-18-57-03

windows 10, internet explorer 11 ❌ Accept-Language: ko-kr KakaoTalk_Photo_2021-01-01-18-57-10

Except OS X Chrome, all browsers are sending not include region code.

Temporally, I put a 'ko-KR' also.

i18n: {
    locales: ['ko','ko-KR','en', 'fr', 'nl'],
    defaultLocale: 'en',
  }

but really annoying..

myriky commented 3 years ago

Just pick the prefix language letter and match it. please.

'en-US' => 'en' 'en' => 'en' 'en-WHATEVER' => 'en' 'en-MARS🚀' => 'en'

'zh-TW'=> 'zh' 'zh-HK'=> 'zh' 'zh-WHATEVER' => 'zh'

'ko-KR' => 'ko' 'ko-KP' => 'ko' 'ko-WHATEVER' => 'ko'

longlho commented 3 years ago

Like I mentioned, zh-TW -> zh would not be correct. zh is alias for Traditional Chinese where zh-TW or zh-HK is Simplified Chinese.

myriky commented 3 years ago

any updates?

j4nos commented 3 years ago

any updates?

traviscrist commented 3 years ago

Is this being fixed? I think I'm running into the same issue with Safari on iOS when its set to spanish its not picking up the "es" locale. Works just fine in Chrome though.

anialamo commented 3 years ago

Is this being fixed? I think I'm running into the same issue with Safari on iOS when its set to spanish its not picking up the "es" locale. Works just fine in Chrome though.

Hi! Do you know, there is a solution for this issue? script? Nexts.js new version? A lot of thanks!

hohl commented 3 years ago

Given that it seems that this issue isn't gonna resolved anytime soon: did somebody at least find a workaround? Some way to hijack the locale negotiation routine maybe?

traviscrist commented 3 years ago

I wish, I just ended up adding a language button to the bottom of my site which lets you flip from english to spanish.

Its amazing there is no work being done to fix this, if this was more than a personal site i’d consider is a major issue.

longlho commented 3 years ago

@hohl technically it is possible if you don't use built in next.js locale routing and just read directly from headers to negotiate

hgezim commented 3 years ago

I already solved the problem in my implementation.

@arguiot How did you solve it?

hgezim commented 3 years ago

Issue also presents itself in Firefox. I surely thought this was me not using Next.js properly. It's super odd that this bug has been allowed to persist since i18n was implemented in Next.js.

chr4ss12 commented 3 years ago

facing same issue here,

@longlho do you know what happened with your PR? https://github.com/vercel/next.js/pull/19552 that looks good to me, I'll try to patch-package it and that should be good workaround for now

longlho commented 3 years ago

the PR is pretty old but that's the general idea. If you're using vercel the issue is upstream routing at the CDN level so that patch won't help.

neb-b commented 3 years ago

I can't tell if this has been reported yet, but it seems this is happening specifically with dynamic routes. If you create a new next.js project with a single pages/index.js file it works correctly. But as soon as I add a pages/[id].js it starts reporting the locale as whatever defaultLocale I have in my next.config.js

Since this seems to be related to the routing, it would be nice to have an option to use localeDetection: true and something like localeRedirection: false

olivierpascal commented 2 years ago

Any updates?

topched commented 2 years ago

Safari does NOT work for us pushing the locale into router.push it always goes back to english. Chrome works just fine. using next v 11.1.2

olivierpascal commented 2 years ago

Here is a simple expression of the problem:

$ curl 'https://mywebsite' -H 'Accept-Language: fr-FR'
Redirecting to /en (308)

$ curl 'https://mywebsite' -H 'Accept-Language: fr'
Redirecting to /fr (307)
timuric commented 2 years ago

One possible workaround is to include all locales, like so:

const es = [
   "es",
   "es-AR",
   "es-BO",
   "es-CL",
   "es-CO",
   "es-CR",
   "es-DO",
   "es-EC",
   "es-ES",
   "es-GT",
   "es-HN",
   "es-MX",
   "es-NI",
   "es-PA",
   "es-PE",
   "es-PR",
   "es-PY",
   "es-SV",
   "es-UY",
   "es-VE"
];

const fr = ["fr", "fr-BE", "fr-CA", "fr-CH", "fr-FR", "fr-LU", "fr-MC"];

const de = ["de", "de-AT", "de-CH", "de-DE", "de-LI", "de-LU"];

const nl = ["nl", "nl-BE", "nl-NL"];

It creates a massive overhead for static page generation though.

I have tested multiple production next websites and unfortunately all of them have this flaw.

It is quite sad that this is a year old issue, I guess it is under reported because it is hard to spot and real users most likely would not report such UX issue.

pozylon commented 2 years ago

@timuric Our bloody workaround was to use _app, add a small hook so there is client-side redirection from / to /de when accpt-language resolves to de* and route is /. But of course it has major downsides

timuric commented 2 years ago

@pozylon

@timuric Our bloody workaround was to use _app, add a small hook so there is client-side redirection from / to /de when accpt-language resolves to de* and route is /. But of course it has major downsides

Haha, we also tried that, but unfortunately it is not bullet proof :(

So far adding all possible locales does the job, but as I mentioned there is a massive overhead that can be a deal breaker for some static paths with many permutations.

pozylon commented 2 years ago

@timuric We use SSG for a webshop and I'd wait 2h for a build if I did that haha ^^

Well it's all crap we need a next-native solution, I'd prefer beeing able to define a custom locale resolver as a function:

{
  i18n: {
     resolveLocale(ctx) { return ctx.req.headers?["X-Startreck"] ? "klingon" : null }
  }
}
longlho commented 2 years ago

you guys can use https://formatjs.io/docs/polyfills/intl-localematcher to manually match locales

ricamgar commented 2 years ago

any update on this?

olivier-voutat commented 2 years ago

I had this same problem. My workaround is to do my _middleware file as this:

import { NextRequest, NextResponse } from 'next/server'

import { ACCEPTED_LOCALES, DEFAULT_LANGUAGE } from '../core/constants/languages'

const PUBLIC_FILE = /\.(.*)$/

export function middleware(req: NextRequest) {
    const shouldHandleLocale =
        !PUBLIC_FILE.test(req.nextUrl.pathname) &&
        !req.nextUrl.pathname.includes('/api/') &&
        req.nextUrl.locale === 'default'

    if (shouldHandleLocale) {
        const language = req.cookies['NEXT_LOCALE'] ?? getBrowserLanguage(req) ?? DEFAULT_LANGUAGE
        const url = req.nextUrl.clone()
        url.pathname = `/${language}${req.nextUrl.pathname}`
        return NextResponse.redirect(url)
    }

    return undefined
}

// Chrome automaticaly adds fallback to generic languages
// so this is not called when it is the used browser
const getBrowserLanguage = (req: NextRequest) => {
    return req.headers
        .get('accept-language')
        ?.split(',')
        .map((i) => i.split(';'))
        ?.reduce(
            (ac: { code: string; priority: string }[], lang) => [
                ...ac,
                { code: lang[0], priority: lang[1] }
            ],
            []
        )
        ?.sort((a, b) => (a.priority > b.priority ? -1 : 1))
        ?.find((i) => ACCEPTED_LOCALES.includes(i.code.substring(0, 2)))
        ?.code?.substring(0, 2)
}

My ACCEPTED_LOCALES variable = ['en', 'fr', 'pt']

olivier-voutat commented 2 years ago

I had this same problem. My workaround is to do my _middleware file as this:

import { NextRequest, NextResponse } from 'next/server'

import { ACCEPTED_LOCALES, DEFAULT_LANGUAGE } from '../core/constants/languages'

const PUBLIC_FILE = /\.(.*)$/

export function middleware(req: NextRequest) {
  const shouldHandleLocale =
      !PUBLIC_FILE.test(req.nextUrl.pathname) &&
      !req.nextUrl.pathname.includes('/api/') &&
      req.nextUrl.locale === 'default'

  if (shouldHandleLocale) {
      const language = req.cookies['NEXT_LOCALE'] ?? getBrowserLanguage(req) ?? DEFAULT_LANGUAGE
      const url = req.nextUrl.clone()
      url.pathname = `/${language}${req.nextUrl.pathname}`
      return NextResponse.redirect(url)
  }

  return undefined
}

// Chrome automaticaly adds fallback to generic languages
// so this is not called when it is the used browser
const getBrowserLanguage = (req: NextRequest) => {
  return req.headers
      .get('accept-language')
      ?.split(',')
      .map((i) => i.split(';'))
      ?.reduce(
          (ac: { code: string; priority: string }[], lang) => [
              ...ac,
              { code: lang[0], priority: lang[1] }
          ],
          []
      )
      ?.sort((a, b) => (a.priority > b.priority ? -1 : 1))
      ?.find((i) => ACCEPTED_LOCALES.includes(i.code.substring(0, 2)))
      ?.code?.substring(0, 2)
}

My ACCEPTED_LOCALES variable = ['en', 'fr', 'pt']

My test results with this workaround (en default language, browsers languages in this order: fr-CH, pt-BR, en-US)

Firefox

http://localhost:3000/ => http://localhost:3000/fr/ - language found in getBrowserLanguage function http://localhost:3000/test-localize/ => http://localhost:3000/fr/test-localize/ - language found in getBrowserLanguage function

Chrome

http://localhost:3000/ => http://localhost:3000/fr/ - language fr sent by chrome, no call to getBrowserLanguage http://localhost:3000/test-localize/ => http://localhost:3000/fr/test-localize/ - language found in getBrowserLanguage function

xdamman commented 9 months ago
$ curl 'https://mywebsite' -H 'Accept-Language: fr-FR'
Redirecting to /en (308) 🫣

$ curl 'https://mywebsite' -H 'Accept-Language: fr'
Redirecting to /fr (307) 🙂

$ curl 'https://mywebsite/path' -H 'Accept-Language: fr'
No redirect (uses default locale)  🫣
Expected: redirecting to /fr/path
Lvdwardt commented 4 months ago

Hi, any updates on this?

Edit by maintainer bot: Comment was automatically minimized because it was considered unhelpful. (If you think this was by mistake, let us know). Please only comment if it adds context to the issue. If you want to express that you have the same problem, use the upvote 👍 on the issue description or subscribe to the issue for updates. Thanks!