Provide a list of current ISO 3166 country codes?

jenstroeger commented 2 years ago

More of a feature question/suggestion: would it be possible to generate and provide a list of ISO 3166-2 country codes? Quick glance at common/main/*.xml (cldr-common.zip) would indicate that the 2-letter codes are provided as

ldml/identity/language@type

I’m just not familiar enough with the standard, but Country/Region (Territory) Names mentions that they’re related:

Country and region names (referred to as Territories in the Survey Tool) may be used as part of Language/Locale Names, […]

Or am I misinterpreting it?

The Babel Core already has a few hardwired country codes for language aliasing:

https://github.com/python-babel/babel/blob/33d1dd738af8da30c7f24efe6b76ff8f56d154fc/babel/core.py#L79-L88

Basically, what I’m suggesting is something like python-iso3166 but generated from the CLDR data, and perhaps as simple strings.

akx commented 2 years ago

Sounds like a good idea. If you feel like writing a PR to implement this, feel free to!

jenstroeger commented 2 years ago

Thanks @akx, happy to create a PR if you don’t mind a little guidance:

I think a list[str] would suffice, which then contains the two-letter country codes?
Should that list live in babel.core next to the LOCALE_ALIASES, maybe called COUNTRIES or… ?
Within the CLDR data, should I work from the XML as noted above, or off of the filenames in common/main/*.xml (you’re more familiar with the CLDR data than I am).
The PR targets import_cldr.py?

akx commented 2 years ago

Sure, just a frozenset of ISO 3166-2s should be fine to begin with, I think? (A list is needlessly ordered.)
It shouldn't be hard-coded (in fact LOCALE_ALIASES probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably be get_iso3166_2_country_codes() or similar, and it would call get_global() to load the data from the pickle files.
The filenames are locale identifiers, not countries. If there is a definitive list of countries within the CLDR data (I don't have a browsable copy at hand right now), then use that by all means.
Yes, to format the data to be pickled into global.dat, and babel/core.py.

jenstroeger commented 2 years ago

Sure, just a frozenset of ISO 3166-2s should be fine to begin with, I think? (A list is needlessly ordered.)

…and performs worse when searched. You just beat me to making the change from list[str] to set[str], but frozenset works as well 👍🏼

It shouldn't be hard-coded (in fact LOCALE_ALIASES probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably be get_iso3166_2_country_codes() or similar, and it would call get_global() to load the data from the pickle files.

Thanks!

The filenames are locale identifiers, not countries. If there is a definitive list of countries within the CLDR data (I don't have a browsable copy at hand right now), then use that by all means.

I’ve not found a list of country codes, but thought that iterating over the main/*.xml files and pulling out their ldml/identity/language/@type values — which should be the country codes.

Yes, to format the data to be pickled into global.dat, and babel/core.py.

Thanks!

jenstroeger commented 2 years ago

I think using ldml/identity/territory/@type (from files in main/*.xml) doesn’t work.

For reference, I downloaded ISO_3166-1_alpha-2.html from Wikipedia and extracted the set of “Officially assigned” country codes

//table[@class="wikitable"][2]//td[contains(@style, "#9EFF9E") or contains(@style, "#BFE")]//span[@class="monospaced"]/text())

which returns 249 elements:

['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BV', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GS', 'GT', 'GU', 'GW', 'GY', 'HK', 'HM', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TF', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'YE', 'YT', 'ZA', 'ZM', 'ZW']

To crosscheck, I saved the page DOM from the iso.org website and extracted country codes from that table:

//table[@class="grs-grid"]//td[@class="grs-status1"]//text()

which returned the exact same result. That set is our baseline.

Next, I hooked into this CLDR import fragment:

https://github.com/python-babel/babel/blob/33d1dd738af8da30c7f24efe6b76ff8f56d154fc/scripts/import_cldr.py#L375-L381

to get to all territory codes used in the CLDR:

--- a/scripts/import_cldr.py
+++ b/scripts/import_cldr.py
@@ -293,7 +293,7 @@ def parse_global(srcdir, sup):
             cur_tender = currency.attrib.get('tender', 'true') == 'true'
             # Tie region to currency.
             region_currencies.append((cur_code, cur_start, cur_end, cur_tender))
-            # Keep a reverse index of currencies to territorie.
+            # Keep a reverse index of currencies to territories.
             all_currencies[cur_code].add(region_code)
         region_currencies.sort(key=_currency_sort_key)
         territory_currencies[region_code] = region_currencies
@@ -343,6 +343,7 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
             if group in territory_containment:
                 containers |= territory_containment[group]
             containers.add(group)
+    iso_3166_country_codes = set()

     # prepare the per-locale plural rules definitions
     plural_rules = _extract_plural_rules(os.path.join(srcdir, 'supplemental', 'plurals.xml'))
@@ -376,6 +377,10 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
         elem = tree.find('.//identity/territory')
         if elem is not None:
             territory = elem.attrib['type']
+            try:
+                int(territory)  # Ignore numeric territory codes.
+            except ValueError:
+                iso_3166_country_codes.add(territory)
         else:
             territory = '001'  # world
         regions = territory_containment.get(territory, [])

This results in 248 territory codes:

['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DG', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EA', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GT', 'GU', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'IC', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'XK', 'YE', 'YT', 'ZA', 'ZM', 'ZW']

Now, let cldr_set be the set of territory codes extracted from the CLDR and let iso_set be the set of country codes extracted from the ISO page (which is the same as the set extracted from Wikipedia) then:

>>> iso_set.difference(cldr_set)
{'HM', 'AQ', 'GS', 'TF', 'BV'}
>>> cldr_set.difference(iso_set)
{'IC', 'DG', 'EA', 'XK'}
>>> wikipedia_set.difference(iso_set)
set()
>>> iso_set.difference(wikipedia_set)
set()

So maybe this is not such a good idea because the CLDR might not contain a complete list of ISO 3166 country codes, but I’m not familiar enough with it to be certain.

I wonder if the ISO folks would donate their Country Codes for the purpose of building this package? Or I’d add some code to scrape the Wikipedia page.

@akx, what do you think?

akx commented 2 years ago

Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:

AQ: Antarctica
BV: Bouvet Island (belongs to Norway)
DG: exceptional reservation for Diego Garcia (Reserved on request of ITU for location of certain telecommunications installations)
EA: exceptional reservation for Ceuta, Melilla (part of Spain) (Reserved on request of WCO for area not covered by European Union Customs arrangements)
GS: South Georgia and the South Sandwich Islands (British Overseas Territory)
HM: Heard Island and McDonald Islands (belongs to Australia)
IC: exceptional reservation for Canary Islands (part of Spain) (Reserved on request of WCO for area not covered by European Union Customs arrangements)
TF: French Southern Territories (Covers the French Southern and Antarctic Lands except Adélie Land)
XK: being used by the European Commission, the IMF, and SWIFT, the CLDR, and other organizations as a temporary country code for Kosovo.

I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...

jenstroeger commented 2 years ago

Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(

I agree very much 😉

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

Oh that’s a useful repo indeed, I didn’t know!

~What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?~ I filed issue #45.

That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:

[…]

I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...

How about his: if we can’t use the data from the Debian repo then I’ll proceed with the CLDR set alone. We could manually add the missing countries (though I feel a little naughty about manual intervention on generated data).

akx commented 2 years ago

What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?

I'm not against that, but I'm a bit on the fence against adding another data source to the import process.

Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?

jenstroeger commented 1 year ago

@akx still no response to issue #45. Shall I proceed with:

Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?

HIRANO-Satoshi commented 1 year ago

@jenstroeger ping, ping, ping

Here is a list of territories. Look at .

https://raw.githubusercontent.com/unicode-org/cldr/main/common/main/en.xml

A JSON version is here.

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-localenames-full/main/en/territories.json

Keys with numbers, UE, EZ, UN, XA, XB, UN-alt-short should be removed.

Atem18 commented 1 year ago

The isocodes project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).

Oh that’s a useful repo indeed, I didn’t know!

Hi, I have put the project as LGPL because I thought that I had to put the same licence as the original project.

But maybe I do not need : https://opensource.stackexchange.com/questions/5175/including-untouched-lgpl-library-in-a-mit-licenced-project#:~:text=Yes.%20Your%20only%20LGPL%20requirements%20apply%20to%20the,would%20then%20use%20some%20of%20the%20library%27s%20classes.

So if someone can really confirm, I can change the licence.

nschimmoller commented 10 months ago

@jenstroeger obviously your prerogative to work on this. However, it makes your life any easier. There is pycountry which also sources it's ISO 3166 data from debian's iso-codes.

>>> import pycountry
>>> len(pycountry.countries)
249
>>> list(pycountry.countries)[0]
Country(alpha_2='AF', alpha_3='AFG', name='Afghanistan', numeric='004', official_name='Islamic Republic of Afghanistan')

jenstroeger commented 10 months ago

Hot diggity, I had lost track of this issue, my apologies!

I need to make time this week to noodle on this and a couple of other OSS Github PRs.

python-babel / babel

Provide a list of current ISO 3166 country codes? #904