python-babel / babel

The official repository for Babel, the Python Internationalization Library
http://babel.pocoo.org/
BSD 3-Clause "New" or "Revised" License
1.33k stars 444 forks source link

Provide a list of current ISO 3166 country codes? #904

Open jenstroeger opened 2 years ago

jenstroeger commented 2 years ago

More of a feature question/suggestion: would it be possible to generate and provide a list of ISO 3166-2 country codes? Quick glance at common/main/*.xml (cldr-common.zip) would indicate that the 2-letter codes are provided as

ldml/identity/language@type

I’m just not familiar enough with the standard, but Country/Region (Territory) Names mentions that they’re related:

Country and region names (referred to as Territories in the Survey Tool) may be used as part of Language/Locale Names, […]

Or am I misinterpreting it?

The Babel Core already has a few hardwired country codes for language aliasing:

https://github.com/python-babel/babel/blob/33d1dd738af8da30c7f24efe6b76ff8f56d154fc/babel/core.py#L79-L88

Basically, what I’m suggesting is something like python-iso3166 but generated from the CLDR data, and perhaps as simple strings.

akx commented 2 years ago

Sounds like a good idea. If you feel like writing a PR to implement this, feel free to!

jenstroeger commented 2 years ago

Thanks @akx, happy to create a PR if you don’t mind a little guidance:

akx commented 2 years ago
  1. Sure, just a frozenset of ISO 3166-2s should be fine to begin with, I think? (A list is needlessly ordered.)
  2. It shouldn't be hard-coded (in fact LOCALE_ALIASES probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably be get_iso3166_2_country_codes() or similar, and it would call get_global() to load the data from the pickle files.
  3. The filenames are locale identifiers, not countries. If there is a definitive list of countries within the CLDR data (I don't have a browsable copy at hand right now), then use that by all means.
  4. Yes, to format the data to be pickled into global.dat, and babel/core.py.
jenstroeger commented 2 years ago
  1. Sure, just a frozenset of ISO 3166-2s should be fine to begin with, I think? (A list is needlessly ordered.)

…and performs worse when searched. You just beat me to making the change from list[str] to set[str], but frozenset works as well 👍🏼

  1. It shouldn't be hard-coded (in fact LOCALE_ALIASES probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably be get_iso3166_2_country_codes() or similar, and it would call get_global() to load the data from the pickle files.

Thanks!

  1. The filenames are locale identifiers, not countries. If there is a definitive list of countries within the CLDR data (I don't have a browsable copy at hand right now), then use that by all means.

I’ve not found a list of country codes, but thought that iterating over the main/*.xml files and pulling out their ldml/identity/language/@type values — which should be the country codes.

  1. Yes, to format the data to be pickled into global.dat, and babel/core.py.

Thanks!

jenstroeger commented 2 years ago

I think using ldml/identity/territory/@type (from files in main/*.xml) doesn’t work.

For reference, I downloaded ISO_3166-1_alpha-2.html from Wikipedia and extracted the set of “Officially assigned” country codes

//table[@class="wikitable"][2]//td[contains(@style, "#9EFF9E") or contains(@style, "#BFE")]//span[@class="monospaced"]/text())

which returns 249 elements:

['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BV', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GS', 'GT', 'GU', 'GW', 'GY', 'HK', 'HM', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TF', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'YE', 'YT', 'ZA', 'ZM', 'ZW']

To crosscheck, I saved the page DOM from the iso.org website and extracted country codes from that table:

//table[@class="grs-grid"]//td[@class="grs-status1"]//text()

which returned the exact same result. That set is our baseline.

Next, I hooked into this CLDR import fragment:

https://github.com/python-babel/babel/blob/33d1dd738af8da30c7f24efe6b76ff8f56d154fc/scripts/import_cldr.py#L375-L381

to get to all territory codes used in the CLDR:

--- a/scripts/import_cldr.py
+++ b/scripts/import_cldr.py
@@ -293,7 +293,7 @@ def parse_global(srcdir, sup):
             cur_tender = currency.attrib.get('tender', 'true') == 'true'
             # Tie region to currency.
             region_currencies.append((cur_code, cur_start, cur_end, cur_tender))
-            # Keep a reverse index of currencies to territorie.
+            # Keep a reverse index of currencies to territories.
             all_currencies[cur_code].add(region_code)
         region_currencies.sort(key=_currency_sort_key)
         territory_currencies[region_code] = region_currencies
@@ -343,6 +343,7 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
             if group in territory_containment:
                 containers |= territory_containment[group]
             containers.add(group)
+    iso_3166_country_codes = set()

     # prepare the per-locale plural rules definitions
     plural_rules = _extract_plural_rules(os.path.join(srcdir, 'supplemental', 'plurals.xml'))
@@ -376,6 +377,10 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
         elem = tree.find('.//identity/territory')
         if elem is not None:
             territory = elem.attrib['type']
+            try:
+                int(territory)  # Ignore numeric territory codes.
+            except ValueError:
+                iso_3166_country_codes.add(territory)
         else:
             territory = '001'  # world
         regions = territory_containment.get(territory, [])

This results in 248 territory codes:

['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DG', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EA', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GT', 'GU', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'IC', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'XK', 'YE', 'YT', 'ZA', 'ZM', 'ZW']

Now, let cldr_set be the set of territory codes extracted from the CLDR and let iso_set be the set of country codes extracted from the ISO page (which is the same as the set extracted from Wikipedia) then:

>>> iso_set.difference(cldr_set)
{'HM', 'AQ', 'GS', 'TF', 'BV'}
>>> cldr_set.difference(iso_set)
{'IC', 'DG', 'EA', 'XK'}
>>> wikipedia_set.difference(iso_set)
set()
>>> iso_set.difference(wikipedia_set)
set()

So maybe this is not such a good idea because the CLDR might not contain a complete list of ISO 3166 country codes, but I’m not familiar enough with it to be certain.

I wonder if the ISO folks would donate their Country Codes for the purpose of building this package? Or I’d add some code to scrape the Wikipedia page.

@akx, what do you think?

akx commented 2 years ago

Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:

I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...

jenstroeger commented 2 years ago

Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(

I agree very much 😉

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

Oh that’s a useful repo indeed, I didn’t know!

~What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?~ I filed issue #45.

That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:

[…]

I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...

How about his: if we can’t use the data from the Debian repo then I’ll proceed with the CLDR set alone. We could manually add the missing countries (though I feel a little naughty about manual intervention on generated data).

akx commented 2 years ago

What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?

I'm not against that, but I'm a bit on the fence against adding another data source to the import process.

Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?

jenstroeger commented 1 year ago

@akx still no response to issue #45. Shall I proceed with:

Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?

HIRANO-Satoshi commented 1 year ago

@jenstroeger ping, ping, ping

Here is a list of territories. Look at .

https://raw.githubusercontent.com/unicode-org/cldr/main/common/main/en.xml

A JSON version is here.

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-localenames-full/main/en/territories.json

Keys with numbers, UE, EZ, UN, XA, XB, UN-alt-short should be removed.

Atem18 commented 1 year ago

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

Oh that’s a useful repo indeed, I didn’t know!

Hi, I have put the project as LGPL because I thought that I had to put the same licence as the original project.

But maybe I do not need : https://opensource.stackexchange.com/questions/5175/including-untouched-lgpl-library-in-a-mit-licenced-project#:~:text=Yes.%20Your%20only%20LGPL%20requirements%20apply%20to%20the,would%20then%20use%20some%20of%20the%20library%27s%20classes.

So if someone can really confirm, I can change the licence.

nschimmoller commented 10 months ago

@jenstroeger obviously your prerogative to work on this. However, it makes your life any easier. There is pycountry which also sources it's ISO 3166 data from debian's iso-codes.

>>> import pycountry
>>> len(pycountry.countries)
249
>>> list(pycountry.countries)[0]
Country(alpha_2='AF', alpha_3='AFG', name='Afghanistan', numeric='004', official_name='Islamic Republic of Afghanistan')
jenstroeger commented 10 months ago

Hot diggity, I had lost track of this issue, my apologies!

I need to make time this week to noodle on this and a couple of other OSS Github PRs.