Closed jordimas closed 1 year ago
Thanks @jordimas , this was helpful. I'll definitely reach out about it.
Hello @jordimas , do you have cleaned monolingual datasets available in Catalan so we could use?
I suppose we can use common voice's? There's 1161772 sentences there.
My understanding is that you want monolingual datasets to do back translation, in this case, ideally the texts should not be part of the parallel corpus. On top of Common Voice that you mention, these corpus also can be helpful:
Let me know if you need more help
Thanks Jordi.
On Wed, Feb 15, 2023, 5:35 AM Jordi Mas @.***> wrote:
My understanding is that you want monolingual datasets to do back translation, in this case, ideally the texts should not be part of the parallel corpus. On top of Common Voice that you mention, these corpus also can be helpful:
- We provide a 18M strings in Catalan https://github.com/Softcatala/parallel-catalan-corpus#catalan-monolingual-corpus
- Paracrawl (https://paracrawl.eu/v8) has Spanish to Catalan dataset provides a good reference of different domains for the Catalan pairs
- Caltan textual corpus https://huggingface.co/datasets/projecte-aina/catalan_textual_corpus provides a corpus of 73M sentences
Let me know if you need more help
— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1431379327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTFCRGN6LN34B6VHGD3WXTLQLANCNFSM6AAAAAASP4LSLU . You are receiving this because you commented.Message ID: @.***>
Hi @jordimas We've trained Catalan to English using your corpora and merged the support for it, and the Nightly version of the extension containing it will be available to test tomorrow morning.
The model was also incorporated in the translations website: https://mozilla.github.io/translate/
Please test it and let us know what you think when you can.
Gràcies!
Thanks for your work @andrenatal
The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.
It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.
The biggest issue that I found is that It does not translate upper case sentences
It's very easy to reproduce, just an example:
LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en) (which is a broken translation)
same text in lower case in properly translated:
La Fundació Mozilla és una organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization
Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.
Thanks again
Hi Jordi.
We have bleu reports posted here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/bleu-results.md#ca-en and comet here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/comet-results.md#ca-en
I will check the situation about Nigthly and your other comments by the morning and let you know.
Thank you
On Thu, Apr 20, 2023, 9:38 PM Jordi Mas @.***> wrote:
Thanks for your work @andrenatal https://github.com/andrenatal
The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.
It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.
The biggest issue that I found is that It does not translate upper case sentences
It's very easy to reproduce, just an example:
LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en)
same text
La Fundació Mozilla és una [organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization
Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.
Thanks again
— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1517246114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTHQVQQKN2SQS3D4I4LXCIFNVANCNFSM6AAAAAASP4LSLU . You are receiving this because you were mentioned.Message ID: @.***>
The Nightly extension is now available and not 404ing anymore @jordimas
On Thu, Apr 20, 2023, 10:17 PM Andre Natal @.***> wrote:
Hi Jordi.
We have bleu reports posted here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/bleu-results.md#ca-en and comet here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/comet-results.md#ca-en
I will check the situation about Nigthly and your other comments by the morning and let you know.
Thank you
On Thu, Apr 20, 2023, 9:38 PM Jordi Mas @.***> wrote:
Thanks for your work @andrenatal https://github.com/andrenatal
The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.
It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.
The biggest issue that I found is that It does not translate upper case sentences
It's very easy to reproduce, just an example:
LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en)
same text
La Fundació Mozilla és una [organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization
Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.
Thanks again
— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1517246114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTHQVQQKN2SQS3D4I4LXCIFNVANCNFSM6AAAAAASP4LSLU . You are receiving this because you were mentioned.Message ID: @.***>
We have an issue open around the problem of ALL CAPS: https://github.com/mozilla/firefox-translations-training/issues/73. I'll close this since Catalan is now supported.
Hello
Please consider adding Catalan language.
In this repository you have a large collection of open source aligned parallel corpus that you can use to train your system:
https://github.com/Softcatala/parallel-catalan-corpus
If you need more help to find dataset please let us know and we can help out