Open jenstroeger opened 2 years ago
Sounds like a good idea. If you feel like writing a PR to implement this, feel free to!
Thanks @akx, happy to create a PR if you don’t mind a little guidance:
list[str]
would suffice, which then contains the two-letter country codes?babel.core
next to the LOCALE_ALIASES
, maybe called COUNTRIES
or… ?common/main/*.xml
(you’re more familiar with the CLDR data than I am).LOCALE_ALIASES
probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably be get_iso3166_2_country_codes()
or similar, and it would call get_global()
to load the data from the pickle files.global.dat
, and babel/core.py
.
- Sure, just a frozenset of ISO 3166-2s should be fine to begin with, I think? (A list is needlessly ordered.)
…and performs worse when searched. You just beat me to making the change from list[str]
to set[str]
, but frozenset works as well 👍🏼
- It shouldn't be hard-coded (in fact
LOCALE_ALIASES
probably shouldn't be either), especially since you'd load it from the CLDR. The API would probably beget_iso3166_2_country_codes()
or similar, and it would callget_global()
to load the data from the pickle files.
Thanks!
- The filenames are locale identifiers, not countries. If there is a definitive list of countries within the CLDR data (I don't have a browsable copy at hand right now), then use that by all means.
I’ve not found a list of country codes, but thought that iterating over the main/*.xml
files and pulling out their ldml/identity/language/@type
values — which should be the country codes.
- Yes, to format the data to be pickled into
global.dat
, andbabel/core.py
.
Thanks!
I think using ldml/identity/territory/@type
(from files in main/*.xml
) doesn’t work.
For reference, I downloaded ISO_3166-1_alpha-2.html from Wikipedia and extracted the set of “Officially assigned” country codes
//table[@class="wikitable"][2]//td[contains(@style, "#9EFF9E") or contains(@style, "#BFE")]//span[@class="monospaced"]/text())
which returns 249 elements:
['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BV', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GS', 'GT', 'GU', 'GW', 'GY', 'HK', 'HM', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TF', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'YE', 'YT', 'ZA', 'ZM', 'ZW']
To crosscheck, I saved the page DOM from the iso.org website and extracted country codes from that table:
//table[@class="grs-grid"]//td[@class="grs-status1"]//text()
which returned the exact same result. That set is our baseline.
Next, I hooked into this CLDR import fragment:
to get to all territory codes used in the CLDR:
--- a/scripts/import_cldr.py
+++ b/scripts/import_cldr.py
@@ -293,7 +293,7 @@ def parse_global(srcdir, sup):
cur_tender = currency.attrib.get('tender', 'true') == 'true'
# Tie region to currency.
region_currencies.append((cur_code, cur_start, cur_end, cur_tender))
- # Keep a reverse index of currencies to territorie.
+ # Keep a reverse index of currencies to territories.
all_currencies[cur_code].add(region_code)
region_currencies.sort(key=_currency_sort_key)
territory_currencies[region_code] = region_currencies
@@ -343,6 +343,7 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
if group in territory_containment:
containers |= territory_containment[group]
containers.add(group)
+ iso_3166_country_codes = set()
# prepare the per-locale plural rules definitions
plural_rules = _extract_plural_rules(os.path.join(srcdir, 'supplemental', 'plurals.xml'))
@@ -376,6 +377,10 @@ def _process_local_datas(sup, srcdir, destdir, force=False, dump_json=False):
elem = tree.find('.//identity/territory')
if elem is not None:
territory = elem.attrib['type']
+ try:
+ int(territory) # Ignore numeric territory codes.
+ except ValueError:
+ iso_3166_country_codes.add(territory)
else:
territory = '001' # world
regions = territory_containment.get(territory, [])
This results in 248 territory codes:
['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AR', 'AS', 'AT', 'AU', 'AW', 'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DG', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EA', 'EC', 'EE', 'EG', 'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GT', 'GU', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'IC', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TG', 'TH', 'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'XK', 'YE', 'YT', 'ZA', 'ZM', 'ZW']
Now, let cldr_set
be the set of territory codes extracted from the CLDR and let iso_set
be the set of country codes extracted from the ISO page (which is the same as the set extracted from Wikipedia) then:
>>> iso_set.difference(cldr_set)
{'HM', 'AQ', 'GS', 'TF', 'BV'}
>>> cldr_set.difference(iso_set)
{'IC', 'DG', 'EA', 'XK'}
>>> wikipedia_set.difference(iso_set)
set()
>>> iso_set.difference(wikipedia_set)
set()
So maybe this is not such a good idea because the CLDR might not contain a complete list of ISO 3166 country codes, but I’m not familiar enough with it to be certain.
I wonder if the ISO folks would donate their Country Codes for the purpose of building this package? Or I’d add some code to scrape the Wikipedia page.
@akx, what do you think?
Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(
The isocodes
project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).
That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:
I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...
Scraping the Wikipedia page during the import process sounds like a bad idea, and dealing with ISO to license the country codes for use in Babel also sounds problematic. :(
I agree very much 😉
The
isocodes
project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).
Oh that’s a useful repo indeed, I didn’t know!
~What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?~ I filed issue #45.
That said, we're no strangers to the CLDR containing partial data, and the codes that show up in those difference lists are all "special" somehow:
[…]
I don't think it's that big of a problem if those do or don't necessarily appear in the CLDR-derived data...
How about his: if we can’t use the data from the Debian repo then I’ll proceed with the CLDR set alone. We could manually add the missing countries (though I feel a little naughty about manual intervention on generated data).
What I can do is file an issue with a license request, and just ask if we’d use parts of their data to generate this package. What says you?
I'm not against that, but I'm a bit on the fence against adding another data source to the import process.
Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?
@akx still no response to issue #45. Shall I proceed with:
Maybe we should start with the CLDR data, with a note that "some territories may be missing if they are not present in the CLDR data"...?
@jenstroeger ping, ping, ping
Here is a list of territories. Look at
https://raw.githubusercontent.com/unicode-org/cldr/main/common/main/en.xml
A JSON version is here.
Keys with numbers, UE, EZ, UN, XA, XB, UN-alt-short should be removed.
The
isocodes
project uses https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/data/iso_3166-1.json as a source. The license there is LGPL (IANAL and all, it's hard to say whether the static data (or derivatives thereof) is/could be licensed under the LGPL).Oh that’s a useful repo indeed, I didn’t know!
Hi, I have put the project as LGPL because I thought that I had to put the same licence as the original project.
But maybe I do not need : https://opensource.stackexchange.com/questions/5175/including-untouched-lgpl-library-in-a-mit-licenced-project#:~:text=Yes.%20Your%20only%20LGPL%20requirements%20apply%20to%20the,would%20then%20use%20some%20of%20the%20library%27s%20classes.
So if someone can really confirm, I can change the licence.
@jenstroeger obviously your prerogative to work on this. However, it makes your life any easier. There is pycountry which also sources it's ISO 3166 data from debian's iso-codes.
>>> import pycountry
>>> len(pycountry.countries)
249
>>> list(pycountry.countries)[0]
Country(alpha_2='AF', alpha_3='AFG', name='Afghanistan', numeric='004', official_name='Islamic Republic of Afghanistan')
Hot diggity, I had lost track of this issue, my apologies!
I need to make time this week to noodle on this and a couple of other OSS Github PRs.
More of a feature question/suggestion: would it be possible to generate and provide a list of ISO 3166-2 country codes? Quick glance at
common/main/*.xml
(cldr-common.zip) would indicate that the 2-letter codes are provided asI’m just not familiar enough with the standard, but Country/Region (Territory) Names mentions that they’re related:
Or am I misinterpreting it?
The Babel Core already has a few hardwired country codes for language aliasing:
https://github.com/python-babel/babel/blob/33d1dd738af8da30c7f24efe6b76ff8f56d154fc/babel/core.py#L79-L88
Basically, what I’m suggesting is something like python-iso3166 but generated from the CLDR data, and perhaps as simple strings.