standardebooks / tools

The Standard Ebooks toolset for producing our ebook files.
Other
1.42k stars 125 forks source link

Additional language checks? #383

Closed vr8hub closed 2 years ago

vr8hub commented 3 years ago

I ran into a couple of things in the review that we may or may not want to check for in lint.

  1. All of the language tags were just "lang" instead of "xml:lang". This may be too nebulous to check for (the possibilities for bad tags are endless), but this specific one might be worth it. (I sometimes forget to put the z3998: in front of salutation, valediction, etc., but those are caught because they're not in the dictionary. The "lang" wasn't caught by anything; it also builds without error.)
  2. There were three-character language tags, so maybe lint could check that they're all two-character, either in whole or prior to the dash, in the case of some of the multi-part Chinese tags.

If you're interested in these, I'll look into them this weekend, I don't have time until then.

acabal commented 3 years ago

1 is a simple check, I can knock that out in 5 minutes. Checking for valid ISO language tags is complicated because there's a lot of variations and sometimes three characters are valid. If you want to explore what makes a valid ISO language tag and whether we can validate it, go for it.

On 12/17/20 3:13 PM, vr8hub wrote:

I ran into a couple of things in the review that we may or may not want to check for in lint.

  1. All of the language tags were just "lang" instead of "xml:lang". This may be too nebulous to check for (the possibilities for bad tags are endless), but this specific one might be worth it. (I sometimes forget to put the z3998: in front of salutation, valediction, etc., but those are caught because they're not in the dictionary. The "lang" wasn't caught by anything; it also builds without error.)
  2. There were three-character language tags, so maybe lint could check that they're all two-character, either in whole or prior to the dash, in the case of some of the multi-part Chinese tags.

If you're interested in these, I'll look into them this weekend, I don't have time until then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/standardebooks/tools/issues/383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGTOYFR2IGDS5L3IFPFQV3SVJX6DANCNFSM4VAH2HHQ.

robinwhittleton commented 3 years ago

An example of a valid three-letter tag with no equivalent two-letter one that I’ve used is enm for Middle English. Also, und when it’s not clear what the language is meant to be.

acabal commented 3 years ago

I've found this site helpful when dealing with language tags: https://r12a.github.io/app-subtags/

vr8hub commented 3 years ago

Very good, thanks to both of you. I've never encountered a language that didn't have a 639-1 code, so I thought (I don't know why; were two-character codes mentioned at some point in our documentation?) we always used two-character ones. My bad.

The first part of the IETF wiki says "A single primary language subtag based on a two-letter language code from ISO 639-1 (2002) or a three-letter code from ISO 639-2 (1998), ISO 639-3 (2007) or ISO 639-5 (2008), or registered through the BCP 47 process and composed of five to eight letters."

From looking at the codes here, there are instances of languages having a 639-2 code but not a 639-1 one, but there are none that don't have a 639-2. I'll get a list of how many this weekend; it may be feasible to just check that, if what's before the dash is not two characters, it has to be one of those codes. I don't want to get into checking all the parts of the dashed ones; 99.9% of ours are plain codes, so baby steps. :)

vr8hub commented 3 years ago

Ugh. There are 489 languages that have a 639-2 (three-digit) code but no 639-1 code. I'm assuming that's too big a list for us to validate against. So, as Ms. Litella woud say…

vr8hub commented 3 years ago

Sigh. Remember when I said I wouldn't have time to look at this til this weekend? Well, I should have waited. Numbers shows the original row number on a filter, not the count, and I haven't used filter on it before, so I misread the data.

It's not as bad, but still significant—303 languages.

acabal commented 2 years ago

Hi Vince, any progress on this or should we close it for now?

vr8hub commented 2 years ago

Yeah, sorry, this one is way on the back-burner. You can close it for now.