neurlang / goruut

IPA Phonemizer for human languages
MIT License
6 stars 0 forks source link

Train more languages based on wikipron #4

Open neurlang opened 6 days ago

neurlang commented 6 days ago

data here:

https://github.com/CUNY-CL/wikipron/tree/master/data/scrape/tsv

neurlang commented 5 days ago

Hindi (hin_deva) Bengali (ben_beng, ben_beng_dhaka, ben_beng_rarh) Punjabi (pan_guru) Telugu (tel_telu) Marathi (mar_deva) Tamil (tam_taml) Gujarati (guj_gujr) Urdu (urd_arab) Turkish (tur_latn) Vietnamese (vie_latn_hanoi, vie_latn_hue, vie_latn_saigon) Polish (pol_latn) Ukrainian (ukr_cyrl) Greek (ell_grek) Hungarian (hun_latn) Malay (msa_latn, msa_arab)

neurlang commented 4 days ago

Afrikaans (afr) – Widely spoken in South Africa and Namibia. Amharic (amh) – Official language of Ethiopia. Azerbaijani (aze) – Spoken in Azerbaijan and parts of Iran. Belarusian (bel) – Spoken primarily in Belarus. Burmese (mya) – Official language of Myanmar. Chechen (che) – Spoken in Chechnya, a part of Russia. Cebuano (ceb) – Widely spoken in the Philippines. Chuvash (chv) – Spoken in Russia (Chuvash Republic). Danish (dan) – Official language of Denmark. Dzongkha (dzo) – Official language of Bhutan. Hausa (hau) – Spoken in Nigeria and surrounding countries. Hebrew (heb) – Official language of Israel. Indonesian (ind) – Official language of Indonesia. Javanese (jav) – Spoken on the island of Java, Indonesia. Kazakh (kaz) – Official language of Kazakhstan. Korean (kor) – Official language of both North and South Korea. Macedonian (mkd) – Official language of North Macedonia. Malayalam (mal) – Spoken in the Indian state of Kerala. Maltese (mlt) – Official language of Malta. Mongolian (mon) – Official language of Mongolia. Nepali (nep) – Official language of Nepal. Pashto (pus) – Spoken in Afghanistan and Pakistan. Somali (som) – Official language of Somalia. Thai (tha) – Official language of Thailand. Tibetan (bod) – Spoken in Tibet and neighboring areas. Uyghur (uig) – Spoken by the Uyghur people in Xinjiang, China. Zulu (zul) – Spoken in South Africa and surrounding areas.

neurlang commented 2 days ago

for the record I will train this

lang code file
afrikaans afr ./afr_latn_narrow.tsv
amharic amh ./amh_ethi_broad.tsv
azerbaijani aze ./aze_latn_narrow_filtered.tsv
belarusian bel ./bel_cyrl_narrow.tsv
burmese mya ./mya_mymr_broad_filtered.tsv
chechen che ./che_cyrl_broad.tsv
cebuano ceb ./ceb_latn_narrow.tsv
danish dan ./dan_latn_narrow.tsv
dzongkha dzo ./dzo_tibt_broad.tsv
hausa hau ./hau_latn_narrow.tsv
hebrew heb ./heb_hebr_narrow.tsv
indonesian ind ./ind_latn_narrow.tsv
javanese jav ./jav_java_broad.tsv
macedonian mkd ./mkd_cyrl_narrow.tsv
malayalam mal ./mal_mlym_narrow.tsv
maltese mlt ./mlt_latn_broad_filtered.tsv
mongolian mon ./mon_cyrl_narrow.tsv
nepali nep ./nep_deva_narrow.tsv
pashto pus ./pus_arab_broad.tsv
thai tha ./tha_thai_broad.tsv
tibetan bod ./bod_tibt_broad.tsv
uyghur uig ./uig_arab_broad.tsv
zulu zul ./zul_latn_broad.tsv
lang code no data
somali som no data
chuvash chv no data