Question: About new languages

loretoparisi commented 6 years ago

Hello, thank you for this repository of TED talk languages parallel corpus! I'm interested in specific languages (like Hindi or Urdu, etc.), I'm not sure which is the process to retrieve these datasets. I'm aware of the transcript and translation processes as described here and on the WIT^3 we site here, but assumed I have specific talk like this in Tamil, how did you automatically get the translations/transcript?

My aim is to augment your dataset with more missing languages, specifically I'm looking for the following talks translations, where I have checked the number of talks available at present time:

Language		Talks
Urdu	urd	146 talks
Malayalam	Mal	43 talks
Hindi	hin	417 talks
Assamese	asm	1 talk
Bengali	ben	111 talks
Gujarati	guj	36 talks
Kannada	kan	14 talks
Marathi	mar	184 talks
Nepali	nep	43 talks
Punjabi	pan	9 talks
Tamil	tam	114 talks
Telugu	tel	59 talks

Japanese	jpn	2565 talks
Chinese, Simplified	zh-Hans	2597 talks
Chinese, Traditional	zh-Hant / zho	2765 talks

Thank you very much.

DevSinghSachan commented 6 years ago

Hi Loreto, In each of the train, test, dev datasets files, each column corresponds to one of the languages. Each row contains the transcripts of the talks at sentence-level. The names of the corresponding TED talks should be present in the first column. Let me know in case you have further questions.

loretoparisi commented 6 years ago

@DevSinghSachan thank you for the info about the dataset. Is it possibile to add a specific new language, like let's say hi or ja? Thanks again.

DevSinghSachan commented 6 years ago

@loretoparisi I think both "hi" and "ja" languages should be present in this dataset, but adding new languages in the current format of the dataset may not be easy, as we have not retained the time-id information from the TED talks transcripts, which is crucial in proper sentence alignment.

loretoparisi commented 6 years ago

@DevSinghSachan okay thank you. I have checked the current supported language from the header file

talk_name   en  es  pt-br   fr  ru  he  ar  ko  zh-cn   it  ja  zh-tw   nl  ro  tr  de  vi  pl  pt  bg  el  fa  sr  hu  hr  uk  cs  id  th  sv  sk  sq  lt  da  calv    my  sl  mk  fr-ca   fi  hy  hi  nb  ka  mn  et  ku  gl  mr  zh  ur  eo  ms  az  ta  bn  kk  be  eu  bs

Thank ok closing.

neulab / word-embeddings-for-nmt

Question: About new languages #1