neulab / word-embeddings-for-nmt

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018
121 stars 19 forks source link

Question: About new languages #1

Closed loretoparisi closed 6 years ago

loretoparisi commented 6 years ago

Hello, thank you for this repository of TED talk languages parallel corpus! I'm interested in specific languages (like Hindi or Urdu, etc.), I'm not sure which is the process to retrieve these datasets. I'm aware of the transcript and translation processes as described here and on the WIT^3 we site here, but assumed I have specific talk like this in Tamil, how did you automatically get the translations/transcript?

My aim is to augment your dataset with more missing languages, specifically I'm looking for the following talks translations, where I have checked the number of talks available at present time:

Language   Talks
Urdu urd 146 talks
Malayalam Mal 43 talks
Hindi hin 417 talks
Assamese asm 1 talk
Bengali ben 111 talks
Gujarati guj 36 talks
Kannada kan 14 talks
Marathi mar 184 talks
Nepali nep 43 talks
Punjabi pan 9 talks
Tamil tam 114 talks
Telugu tel 59 talks
     
Japanese jpn 2565 talks
Chinese, Simplified zh-Hans 2597 talks
Chinese, Traditional zh-Hant / zho 2765 talks

Thank you very much.

DevSinghSachan commented 6 years ago

Hi Loreto, In each of the train, test, dev datasets files, each column corresponds to one of the languages. Each row contains the transcripts of the talks at sentence-level. The names of the corresponding TED talks should be present in the first column. Let me know in case you have further questions.

loretoparisi commented 6 years ago

@DevSinghSachan thank you for the info about the dataset. Is it possibile to add a specific new language, like let's say hi or ja? Thanks again.

DevSinghSachan commented 6 years ago

@loretoparisi I think both "hi" and "ja" languages should be present in this dataset, but adding new languages in the current format of the dataset may not be easy, as we have not retained the time-id information from the TED talks transcripts, which is crucial in proper sentence alignment.

loretoparisi commented 6 years ago

@DevSinghSachan okay thank you. I have checked the current supported language from the header file

talk_name   en  es  pt-br   fr  ru  he  ar  ko  zh-cn   it  ja  zh-tw   nl  ro  tr  de  vi  pl  pt  bg  el  fa  sr  hu  hr  uk  cs  id  th  sv  sk  sq  lt  da  calv    my  sl  mk  fr-ca   fi  hy  hi  nb  ka  mn  et  ku  gl  mr  zh  ur  eo  ms  az  ta  bn  kk  be  eu  bs

Thank ok closing.