kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

lang tags: using BCP47 instead of ISO639-1 codes #113

Open eroux opened 1 year ago

eroux commented 1 year ago

Hello, first thank you very much for your work on hocr! I'm part of an organization that gets hocr from Google Books and I'm quite new to the specification. Something that caught my eye is the reference to ISO639-1 for language codes. Since it doesn't contain all language codes, I think referring to BCP47 is more generic and future-proof. What do you think? It's a retro-compatible change since ISO639-1 tags are BCP47 compliant (at least in a first approximation)

kba commented 1 year ago

I don't feel strongly either way, but it might be a good opportunity to align with how ALTO and PAGE handle language/script.

In ALTO we decided on using what xsd:language expects, i.e. RFC 1766, which in turn references ISO639-1. IIUC this might not be expressive enough for your puproses?

eroux commented 1 year ago

thanks for your answer!

My understanding of the latest XSD spec is that it requires BCP47 lang tags, the 1.0 spec indeed refers to RFC1766. I don't think there might be any reason why RFC1766 should be recommended instead of BCP47, but perhaps there are some?