internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
38 stars 9 forks source link

Feature: Add ISO 639 part2b support for normalize_language #11

Open scottbarnes opened 1 month ago

scottbarnes commented 1 month ago

This commit adds support for converting to two characters ISO 639 Part2b languages, e.g. fre for French rather than the Part3 fra.

IA items will often include fre, ger, etc., in the metadata language field (see, e.g. https://archive.org/metadata/101610331.nlm.nih.gov/metadata/language).

But this was being passed through as the literal string fre rather than being converted to fr. DAISY and Epub readers don't recognize fra as a valid languge, and instead display the literal string.