Closed vr8hub closed 2 years ago
1 is a simple check, I can knock that out in 5 minutes. Checking for valid ISO language tags is complicated because there's a lot of variations and sometimes three characters are valid. If you want to explore what makes a valid ISO language tag and whether we can validate it, go for it.
On 12/17/20 3:13 PM, vr8hub wrote:
I ran into a couple of things in the review that we may or may not want to check for in lint.
- All of the language tags were just "lang" instead of "xml:lang". This may be too nebulous to check for (the possibilities for bad tags are endless), but this specific one might be worth it. (I sometimes forget to put the z3998: in front of salutation, valediction, etc., but those are caught because they're not in the dictionary. The "lang" wasn't caught by anything; it also builds without error.)
- There were three-character language tags, so maybe lint could check that they're all two-character, either in whole or prior to the dash, in the case of some of the multi-part Chinese tags.
If you're interested in these, I'll look into them this weekend, I don't have time until then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/standardebooks/tools/issues/383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGTOYFR2IGDS5L3IFPFQV3SVJX6DANCNFSM4VAH2HHQ.
An example of a valid three-letter tag with no equivalent two-letter one that I’ve used is enm
for Middle English. Also, und
when it’s not clear what the language is meant to be.
I've found this site helpful when dealing with language tags: https://r12a.github.io/app-subtags/
Very good, thanks to both of you. I've never encountered a language that didn't have a 639-1 code, so I thought (I don't know why; were two-character codes mentioned at some point in our documentation?) we always used two-character ones. My bad.
The first part of the IETF wiki says "A single primary language subtag based on a two-letter language code from ISO 639-1 (2002) or a three-letter code from ISO 639-2 (1998), ISO 639-3 (2007) or ISO 639-5 (2008), or registered through the BCP 47 process and composed of five to eight letters."
From looking at the codes here, there are instances of languages having a 639-2 code but not a 639-1 one, but there are none that don't have a 639-2. I'll get a list of how many this weekend; it may be feasible to just check that, if what's before the dash is not two characters, it has to be one of those codes. I don't want to get into checking all the parts of the dashed ones; 99.9% of ours are plain codes, so baby steps. :)
Ugh. There are 489 languages that have a 639-2 (three-digit) code but no 639-1 code. I'm assuming that's too big a list for us to validate against. So, as Ms. Litella woud say…
Sigh. Remember when I said I wouldn't have time to look at this til this weekend? Well, I should have waited. Numbers shows the original row number on a filter, not the count, and I haven't used filter on it before, so I misread the data.
It's not as bad, but still significant—303 languages.
Hi Vince, any progress on this or should we close it for now?
Yeah, sorry, this one is way on the back-burner. You can close it for now.
I ran into a couple of things in the review that we may or may not want to check for in lint.
If you're interested in these, I'll look into them this weekend, I don't have time until then.