mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Normalize punctuation marks in parallel data #769

Open gregtatum opened 3 months ago

gregtatum commented 3 months ago

NLLB frequently has weird punctuation changes in the source and target.

en sl
Just like its predecessors, the Black Shark 3 is also expected to... Tako kot Spiky Shark tudi novi Black Shark zagotavlja na..
Interview preparation is exactl… Spored aktivnosti med pripravami je natančno...
Swiftest and sweetest hours.” najhitrejše in najslajše ure!
Stopping smoking will improve your health. - Prenehanje kajenja bo izboljšalo vaše zdravje
"Why don't you write about horses? »Zakaj ne pišete o motorjih!?«
· All relevant facts regarding the matter being grieved; • vsa pomembna dejstva, ki se nanašajo na zadevo,
The Nokia 8 Sirocco will be available from end April in the UAE for an average retail price of AED 2399. • Nokia 8 Sirocco bo na voljo v začetku aprila po globalni povprečni maloprodajni ceni od 749 eur.

Example Rules:

Locale data is available in Unicode's CLDR, which could be used to build the rulesets.

https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/es/characters.json

https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/sl/delimiters.json