rspeer / langcodes

A Python library for working with and comparing language codes.
MIT License
339 stars 27 forks source link

More parsing problems #50

Closed dscorbett closed 3 years ago

dscorbett commented 3 years ago

Here are some ill-formed tags that this library doesn’t throw exceptions for, and one well-formed (though invalid) tag that it does throw an exception for.

>>> Language.get('x-')
Language.make(language='x-')
>>> Language.get('x-123456789')
Language.make(language='x-123456789')
>>> Language.get('x-')
Language.make(language='x-\ue83f\ue857\ue852\ue83f')
>>> Language.get('und-u-')
Language.make(extensions=['u-'])
>>> Language.get('und-?-foo')
Language.make(extensions=['?-foo'])
>>> Language.get('ar-٠٠١')
Language.make(language='ar', territory='٠٠١')
>>> Language.get('zh-普通话')
Language.make(language='zh', extlangs=['普通话'])
>>> Language.get('non-ᚱᚢᚾᛟ')
Language.make(language='non', script='ᚱᚢᚾᛟ')
>>> Language.get('fr-1606thré')
Language.make(language='fr', variants=['1606thré'])
>>> Language.get('example')
langcodes.tag_parser.LanguageTagError: Expected a language code, got 'example'
rspeer commented 3 years ago

Wow, thanks for the detailed checks.

Can you tell me why "example" should parse?

dscorbett commented 3 years ago

One of the productions for language is 5*8ALPHA. Such subtags are invalid but well-formed. BCP 47 says:

   5.  Any language subtags of five to eight characters in length in the
       IANA registry were defined via the registration process in
       Section 3.5 and MAY be used to form the primary language subtag.
       An example of what such a registration might include is the
       grandfathered IANA registration "i-enochian".  The subtag
       'enochian' could be registered in the IANA registry as a primary
       language subtag (assuming that ISO 639 does not register this
       language first), making tags such as "enochian-AQ" and "enochian-
       Latn" valid.

       At the time this document was created, there were no examples of
       this kind of subtag.  Future registrations of this type are
       discouraged: an attempt to register any new proposed primary
       language MUST be made to the ISO 639 registration authority.
       Proposals rejected by the ISO 639 registration authority are
       unlikely to meet the criteria for primary language subtags and
       are thus unlikely to be registered.
rspeer commented 3 years ago

It's not saying that I MUST parse these, right? I think it would cause actual confusion and raise the potential for error if strings that are shaped like no existing language tag, and no plausible future language tag, were parsed as languages.

The "Enochian language" was a weird hoax anyway. I know we have to be able to parse i-enochian for backward compatibility with a standard that it ended up in, and we do, but it's normalized to x-i-enochian and always should be.

dscorbett commented 3 years ago

I thought it would be more consistent to allow it. und-aaa-bbb-ccc is parsed fine, though it is permanently reserved as invalid, whereas example is merely implausible. Still, as long as the documentation doesn’t say something like “Language.get accepts all well-formed BCP 47 tags”, it’s not wrong to keep it as is.

rspeer commented 3 years ago

Thanks! If you don't mind checking whether I did it right in the most recent push to master, I can wait a moment before releasing v3.2.1

dscorbett commented 3 years ago

Subtags in extensions are not checked.

>>> tag_is_valid('und-u-')
True
>>> tag_is_valid('und-e-:(')
True
>>> tag_is_valid('und-a-123456789')
True

Numeric singletons are rejected.

>>> tag_is_valid('und-0-foo')
False
rspeer commented 3 years ago

Thanks for keeping track of all these cases. I fixed those (now I check all subtags to make sure they're 1-8 alphanumeric ASCII characters).