Closed dscorbett closed 3 years ago
Wow, thanks for the detailed checks.
Can you tell me why "example" should parse?
One of the productions for language
is 5*8ALPHA
. Such subtags are invalid but well-formed. BCP 47 says:
5. Any language subtags of five to eight characters in length in the
IANA registry were defined via the registration process in
Section 3.5 and MAY be used to form the primary language subtag.
An example of what such a registration might include is the
grandfathered IANA registration "i-enochian". The subtag
'enochian' could be registered in the IANA registry as a primary
language subtag (assuming that ISO 639 does not register this
language first), making tags such as "enochian-AQ" and "enochian-
Latn" valid.
At the time this document was created, there were no examples of
this kind of subtag. Future registrations of this type are
discouraged: an attempt to register any new proposed primary
language MUST be made to the ISO 639 registration authority.
Proposals rejected by the ISO 639 registration authority are
unlikely to meet the criteria for primary language subtags and
are thus unlikely to be registered.
It's not saying that I MUST
parse these, right? I think it would cause actual confusion and raise the potential for error if strings that are shaped like no existing language tag, and no plausible future language tag, were parsed as languages.
The "Enochian language" was a weird hoax anyway. I know we have to be able to parse i-enochian
for backward compatibility with a standard that it ended up in, and we do, but it's normalized to x-i-enochian
and always should be.
I thought it would be more consistent to allow it. und-aaa-bbb-ccc
is parsed fine, though it is permanently reserved as invalid, whereas example
is merely implausible. Still, as long as the documentation doesn’t say something like “Language.get
accepts all well-formed BCP 47 tags”, it’s not wrong to keep it as is.
Thanks! If you don't mind checking whether I did it right in the most recent push to master, I can wait a moment before releasing v3.2.1
Subtags in extensions are not checked.
>>> tag_is_valid('und-u-')
True
>>> tag_is_valid('und-e-:(')
True
>>> tag_is_valid('und-a-123456789')
True
Numeric singletons are rejected.
>>> tag_is_valid('und-0-foo')
False
Thanks for keeping track of all these cases. I fixed those (now I check all subtags to make sure they're 1-8 alphanumeric ASCII characters).
Here are some ill-formed tags that this library doesn’t throw exceptions for, and one well-formed (though invalid) tag that it does throw an exception for.