tor2web / Tor2web

Tor2web is an HTTP proxy software that enables access to Tor Hidden Services by mean of common web browsers
https://www.tor2web.org
GNU Affero General Public License v3.0
700 stars 177 forks source link

Improve language detection via user-agent info for gettor #217

Closed ilv closed 9 years ago

ilv commented 9 years ago

Regarding #168, I think we can follow this and this, specially this idea:

"The basic rule here is that if your language preference list contains a language tag containing a hyphen, such as fr-CH (French as spoken in Switzerland), you should consider adding an additional language tag without the hyphen, ie. fr (French) in this case, immediately after."

So, a basic algorithm for this could be as follows:

  1. Get language from User-Agent, let's say 'lc-LC'
  2. Check if 'lc-LC' matches any of the locales supported by the Tor Browser. If so, we are done.
  3. Take the first two letters before the dash ('lc' in this case) and check if it maches any of the locales supported by the Tor Browser. If so, we are done.
  4. Else, return default lc e.g. en-US.

For this purpose we should make some sort of mapping for en, es, pt, and zh to en-US, es-ES, pt-PT and zh-CN respectively. For instance:

This covers the case of a browser that is not configured properly e.g. have pt-BR and pt as preferred languages.

Thoughts? @fpietrosanti @evilaliv3

evilaliv3 commented 9 years ago

thanks ilv your analisis is good; it seems to me that it has only a case of failure that is the following

suppose the user locale is "es" , but right now TBB offers only es-ES, this way we will end providing en-US.

i've written the following algorithm that is like a charm on current Tor locales that are:

ar
de
en-US
es-ES
fa
fr
it
ko
nl
pl
pt-PT
ru
tr
vi
zh-CN

analyze it with the following inputs and tell me what you think:

  ar -> will end providing ar
  es -> will end providing es-ES
  es-PT -> will end providing es-ES
  es-ES -> will end providing ES
  xx -> will end providing en-US

algorithm:

def getBestLangMatch(accept_language, supported_lcs):
    def parse_accept_language(accept_language):
        return [l.split(';')[0] for l in accept_language.replace(" ", "").split(',')]

    def language_only(lc):
        if '-' in lc:
            lc = lc.split('-')[0]

        return lc

    for lc in parse_accept_language(accept_language):
        # returns es-PT if es-PT is available (perfect match)
        for l in supported_lcs:
            if lc.lower() == l.lower():
                return l

        lc = language_only(lc)

        # returns es if asking for es-PT with
        # es-PT not available but es available
        for l in supported_lcs:
            if lc.lower() == l.lower():
                return l

        # returns es-ES if asking for es-PT with
        # es-PT and es not available but  es-ES available
        for l in supported_lcs:
            if lc.lower() == language_only(l).lower():
                return l

    return 'en-US' # last resort

what do you think?

evilaliv3 commented 9 years ago

@ilv i've written a demo of the algorithm above to prove it agaist the cases i'm expecting: https://gist.github.com/evilaliv3/5a9cd11eaa0cf60da425

any comment?

ilv commented 9 years ago

Great @evilaliv3, it seems that you have covered all the cases :) The algorithm works quite well, I've tested it with some extra inputs and all is good. Given the few locales supported by Tor Browser I think this will work perfectly fine.