mozilla / firefox-translations

Firefox Translations is a webextension that enables client side translations for web browsers.
Mozilla Public License 2.0
577 stars 49 forks source link

Please support for Catalan language #602

Closed jordimas closed 1 year ago

jordimas commented 1 year ago

Hello

Please consider adding Catalan language.

In this repository you have a large collection of open source aligned parallel corpus that you can use to train your system:

https://github.com/Softcatala/parallel-catalan-corpus

If you need more help to find dataset please let us know and we can help out

andrenatal commented 1 year ago

Thanks @jordimas , this was helpful. I'll definitely reach out about it.

andrenatal commented 1 year ago

Hello @jordimas , do you have cleaned monolingual datasets available in Catalan so we could use?

andrenatal commented 1 year ago

I suppose we can use common voice's? There's 1161772 sentences there.

jordimas commented 1 year ago

My understanding is that you want monolingual datasets to do back translation, in this case, ideally the texts should not be part of the parallel corpus. On top of Common Voice that you mention, these corpus also can be helpful:

Let me know if you need more help

andrenatal commented 1 year ago

Thanks Jordi.

On Wed, Feb 15, 2023, 5:35 AM Jordi Mas @.***> wrote:

My understanding is that you want monolingual datasets to do back translation, in this case, ideally the texts should not be part of the parallel corpus. On top of Common Voice that you mention, these corpus also can be helpful:

Let me know if you need more help

— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1431379327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTFCRGN6LN34B6VHGD3WXTLQLANCNFSM6AAAAAASP4LSLU . You are receiving this because you commented.Message ID: @.***>

andrenatal commented 1 year ago

Hi @jordimas We've trained Catalan to English using your corpora and merged the support for it, and the Nightly version of the extension containing it will be available to test tomorrow morning.

The model was also incorporated in the translations website: https://mozilla.github.io/translate/

Please test it and let us know what you think when you can.

Gràcies!

jordimas commented 1 year ago

Thanks for your work @andrenatal

The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.

It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.

The biggest issue that I found is that It does not translate upper case sentences

It's very easy to reproduce, just an example:

LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en) (which is a broken translation)

same text in lower case in properly translated:

La Fundació Mozilla és una organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization

Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.

Thanks again

andrenatal commented 1 year ago

Hi Jordi.

We have bleu reports posted here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/bleu-results.md#ca-en and comet here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/comet-results.md#ca-en

I will check the situation about Nigthly and your other comments by the morning and let you know.

Thank you

On Thu, Apr 20, 2023, 9:38 PM Jordi Mas @.***> wrote:

Thanks for your work @andrenatal https://github.com/andrenatal

The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.

It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.

The biggest issue that I found is that It does not translate upper case sentences

It's very easy to reproduce, just an example:

LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en)

same text

La Fundació Mozilla és una [organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization

Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.

Thanks again

— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1517246114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTHQVQQKN2SQS3D4I4LXCIFNVANCNFSM6AAAAAASP4LSLU . You are receiving this because you were mentioned.Message ID: @.***>

andrenatal commented 1 year ago

The Nightly extension is now available and not 404ing anymore @jordimas

On Thu, Apr 20, 2023, 10:17 PM Andre Natal @.***> wrote:

Hi Jordi.

We have bleu reports posted here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/bleu-results.md#ca-en and comet here https://github.com/mozilla/firefox-translations-models/blob/main/evaluation/dev/comet-results.md#ca-en

I will check the situation about Nigthly and your other comments by the morning and let you know.

Thank you

On Thu, Apr 20, 2023, 9:38 PM Jordi Mas @.***> wrote:

Thanks for your work @andrenatal https://github.com/andrenatal

The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.

It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.

The biggest issue that I found is that It does not translate upper case sentences

It's very easy to reproduce, just an example:

LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) -> The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en)

same text

La Fundació Mozilla és una [organització sense ànim de lucre (ca) -> The Mozilla Foundation is a non-profit organization

Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.

Thanks again

— Reply to this email directly, view it on GitHub https://github.com/mozilla/firefox-translations/issues/602#issuecomment-1517246114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTHQVQQKN2SQS3D4I4LXCIFNVANCNFSM6AAAAAASP4LSLU . You are receiving this because you were mentioned.Message ID: @.***>

marco-c commented 1 year ago

We have an issue open around the problem of ALL CAPS: https://github.com/mozilla/firefox-translations-training/issues/73. I'll close this since Catalan is now supported.