Improve language detection via user-agent info for gettor

ilv commented 9 years ago

Regarding #168, I think we can follow this and this, specially this idea:

"The basic rule here is that if your language preference list contains a language tag containing a hyphen, such as fr-CH (French as spoken in Switzerland), you should consider adding an additional language tag without the hyphen, ie. fr (French) in this case, immediately after."

So, a basic algorithm for this could be as follows:

Get language from User-Agent, let's say 'lc-LC'
Check if 'lc-LC' matches any of the locales supported by the Tor Browser. If so, we are done.
Take the first two letters before the dash ('lc' in this case) and check if it maches any of the locales supported by the Tor Browser. If so, we are done.
Else, return default lc e.g. en-US.

For this purpose we should make some sort of mapping for en, es, pt, and zh to en-US, es-ES, pt-PT and zh-CN respectively. For instance:

We get the language from User-Agent: pt-BR
We check 2) and keep going
We check 3). Without the mapping pt => pt-PT we should have keep going to 4). With the mapping we are able to provide a more familiar language.

This covers the case of a browser that is not configured properly e.g. have pt-BR and pt as preferred languages.

Thoughts? @fpietrosanti @evilaliv3

evilaliv3 commented 9 years ago

thanks ilv your analisis is good; it seems to me that it has only a case of failure that is the following

suppose the user locale is "es" , but right now TBB offers only es-ES, this way we will end providing en-US.

i've written the following algorithm that is like a charm on current Tor locales that are:

ar
de
en-US
es-ES
fa
fr
it
ko
nl
pl
pt-PT
ru
tr
vi
zh-CN

analyze it with the following inputs and tell me what you think:

  ar -> will end providing ar
  es -> will end providing es-ES
  es-PT -> will end providing es-ES
  es-ES -> will end providing ES
  xx -> will end providing en-US

algorithm:

def getBestLangMatch(accept_language, supported_lcs):
    def parse_accept_language(accept_language):
        return [l.split(';')[0] for l in accept_language.replace(" ", "").split(',')]

    def language_only(lc):
        if '-' in lc:
            lc = lc.split('-')[0]

        return lc

    for lc in parse_accept_language(accept_language):
        # returns es-PT if es-PT is available (perfect match)
        for l in supported_lcs:
            if lc.lower() == l.lower():
                return l

        lc = language_only(lc)

        # returns es if asking for es-PT with
        # es-PT not available but es available
        for l in supported_lcs:
            if lc.lower() == l.lower():
                return l

        # returns es-ES if asking for es-PT with
        # es-PT and es not available but  es-ES available
        for l in supported_lcs:
            if lc.lower() == language_only(l).lower():
                return l

    return 'en-US' # last resort

what do you think?

evilaliv3 commented 9 years ago

@ilv i've written a demo of the algorithm above to prove it agaist the cases i'm expecting: https://gist.github.com/evilaliv3/5a9cd11eaa0cf60da425

any comment?

ilv commented 9 years ago

Great @evilaliv3, it seems that you have covered all the cases :) The algorithm works quite well, I've tested it with some extra inputs and all is good. Given the few locales supported by Tor Browser I think this will work perfectly fine.

tor2web / Tor2web

Improve language detection via user-agent info for gettor #217