thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

JW300 v1c #70

Closed kpu closed 3 years ago

kpu commented 3 years ago

https://opus.nlpl.eu/JW300.php

"Version 1c provides proper raw untokenized texts and also fixes some additional problems with language codes. "

thammegowda commented 3 years ago

Thanks! this will be added in the next version.


we just need to change URLs here https://github.com/thammegowda/mtdata/blob/8cc7f5ba034e6d30eceb9ab37437f4ba0104368e/mtdata/index/opus/jw300.py#L415-L416 from v1 --> v1c

thammegowda commented 3 years ago

@kpu and @jorgtied I tried to use v1c, but I am still getting tokenized text instead of the expected raw/untokenized text.

Its the same case with opustools (getting tokenized data)

$ pip install opustools
$ opus_read -d JW300 -s af -t bg -wm moses -r v1c -w jw300.af jw300.bg

$ head -3 jw300.af jw300.bg
==> jw300.af <==
Gesinsbeplanning — die Christelike beskouing
BY DIE eerste Wêreldbevolkingskonferensie in 1974 het die 140 nasies wat vergader het , besluit dat alle egpare “ die basiese reg het om vryelik en op ’ n verantwoordelike wyse te besluit oor die aantal en die spasiëring van hulle kinders en om die inligting , opvoeding en middele te hê om dit te doen ” .
Baie beskou dit as ’ n goeie besluit .

==> jw300.bg <==
Планиране на семейството — християнският възглед
СТО и четиридесетте страни , които участвуваха в първата Световна конференция по въпросите на населението , проведена през 1974 г . , решиха , че всяко семейство „ има основното право да решава свободно и отговорно относно броя на децата си , а също и разликата помежду им , и да има информацията , образованието и средствата да прави това “ .
Много хора смятат , че това решение е добро .

I manually inspected this v1c file: https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip
It has tokenized words:

<?xml version="1.0" encoding="utf-8"?>
<text>
<s id="1">
    <w id="1.1">“</w>
    <w id="1.2">A</w>
    <w id="1.3">Good</w>
    <w id="1.4">Word</w>
    <w id="1.5">for</w>
    <w id="1.6">the</w>
    <w id="1.7">Witnesses</w>
    <w id="1.8">”</w>
</s>
<s id="2">
    <w id="2.1">THE</w>
    <w id="2.2">preaching</w>
    <w id="2.3">activity</w>
    <w id="2.4">of</w>
    <w id="2.5">Jehovah</w>
    <w id="2.6">’</w>
    <w id="2.7">s</w>
    <w id="2.8">witnesses</w>
    <w id="2.9">is</w>
    <w id="2.10">growing</w>
    <w id="2.11">very</w>
    <w id="2.12">rapidly</w>
    <w id="2.13">.</w>
</s>

Cant figure out what I am missing here to get raw untokenized text.

jorgtied commented 3 years ago

Try adding -p raw to the command. That should give you the untokenized text. Jörg

On 3. Oct 2021, at 22.46, Thamme Gowda @.***> wrote:

@kpu https://github.com/kpu and @jorgtied https://github.com/jorgtied I tried to use v1c, but I am still getting tokenized text instead of the expected raw/untokenized text.

Its the same case with opustools (getting tokenized data)

$ pip install opustools $ opus_read -d JW300 -s af -t bg -wm moses -r v1c -w jw300.af jw300.bg

$ head -3 jw300.af jw300.bg ==> jw300.af <== Gesinsbeplanning — die Christelike beskouing BY DIE eerste Wêreldbevolkingskonferensie in 1974 het die 140 nasies wat vergader het , besluit dat alle egpare “ die basiese reg het om vryelik en op ’ n verantwoordelike wyse te besluit oor die aantal en die spasiëring van hulle kinders en om die inligting , opvoeding en middele te hê om dit te doen ” . Baie beskou dit as ’ n goeie besluit .

==> jw300.bg <== Планиране на семейството — християнският възглед СТО и четиридесетте страни , които участвуваха в първата Световна конференция по въпросите на населението , проведена през 1974 г . , решиха , че всяко семейство „ има основното право да решава свободно и отговорно относно броя на децата си , а също и разликата помежду им , и да има информацията , образованието и средствата да прави това “ . Много хора смятат , че това решение е добро . I manually inspected this v1c file: https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip https://opus.nlpl.eu/download.php?f=JW300/v1c/xml/en.zip It has tokenized words:

<?xml version="1.0" encoding="utf-8"?>

A Good Word for the Witnesses THE preaching activity of Jehovah s witnesses is growing very rapidly . Cant figure out what I am missing here to get raw untokenized text. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe . Triage notifications on the go with GitHub Mobile for iOS or Android .
thammegowda commented 3 years ago

@jorgtied Works now. Thanks for the quick reply.