Open kpu opened 2 years ago
Sorry for the delayed response.
if I understand correctly, the problem here is en
and en-GB
are not the same (i.e., exact match) but they are compatible. We should consider compatibility match instead of exact match.
The internal parser is already converting all lang IDs to bcp47 objs while parsing TMX, https://github.com/thammegowda/mtdata/blob/ffa6c6736469fb428cd92e98b02d9f6fd45aa006/mtdata/tmx.py#L32
but the issue is, we are doing exact match https://github.com/thammegowda/mtdata/blob/ffa6c6736469fb428cd92e98b02d9f6fd45aa006/mtdata/tmx.py#L65
Solution: we need to switch to compatibility check (instead of exact match) which is already implemented here BCP47Tag.are_compatible
I have improved language matching in https://github.com/thammegowda/mtdata/commit/c2e024d916f69ff95045900a32f0810803a15b53
Files like this https://elrc-share.eu/repository/download/eb7d99f809f611e9b7d400155d02670650ef3e7a68e54dbe8e60197297332488/ have de-DE and en-GB language codes in the TMX. If I'm sloppy and just provide de and en in the index then I get
because the TMX parser does a strict match.
Actually it's worse than that: the file announces en-GB:
Then all the segments have en:
Coding it as en and de_DE in the index seems to work, seems overkill.