thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Policy on BCP-47 in TMX files? #98

Open kpu opened 2 years ago

kpu commented 2 years ago

Files like this https://elrc-share.eu/repository/download/eb7d99f809f611e9b7d400155d02670650ef3e7a68e54dbe8e60197297332488/ have de-DE and en-GB language codes in the TMX. If I'm sloppy and just provide de and en in the index then I get

Exception: Nothing for deu-eng in TMX ZipPath(root=PosixPath('/home/$USER/.mtdata/elrc-share.eu/4e4d/909f297fcad10f8094cd92c56016/ELRC_1086.zip'), name='de_bundeskanzlerin_A_deu-eng_reduced_stripped.tmx')

because the TMX parser does a strict match.

Actually it's worse than that: the file announces en-GB:

        <prop type="l1">de-DE</prop>
        <prop type="l2">en-GB</prop>

Then all the segments have en:

        <tu>
            <prop type="license"/>
            <tuv xml:lang="de-DE">
                <seg>Preisverleihung</seg>
            </tuv>
            <tuv xml:lang="en">
                <seg>Award presentation</seg>
            </tuv>
        </tu>

Coding it as en and de_DE in the index seems to work, seems overkill.

thammegowda commented 2 years ago

Sorry for the delayed response.

if I understand correctly, the problem here is en and en-GB are not the same (i.e., exact match) but they are compatible. We should consider compatibility match instead of exact match.

  1. The internal parser is already converting all lang IDs to bcp47 objs while parsing TMX, https://github.com/thammegowda/mtdata/blob/ffa6c6736469fb428cd92e98b02d9f6fd45aa006/mtdata/tmx.py#L32

  2. but the issue is, we are doing exact match https://github.com/thammegowda/mtdata/blob/ffa6c6736469fb428cd92e98b02d9f6fd45aa006/mtdata/tmx.py#L65

Solution: we need to switch to compatibility check (instead of exact match) which is already implemented here BCP47Tag.are_compatible

https://github.com/thammegowda/mtdata/blob/ffa6c6736469fb428cd92e98b02d9f6fd45aa006/mtdata/iso/bcp47.py#L66

thammegowda commented 2 years ago

I have improved language matching in https://github.com/thammegowda/mtdata/commit/c2e024d916f69ff95045900a32f0810803a15b53