Closed loretoparisi closed 6 years ago
Hi Loreto, In each of the train, test, dev datasets files, each column corresponds to one of the languages. Each row contains the transcripts of the talks at sentence-level. The names of the corresponding TED talks should be present in the first column. Let me know in case you have further questions.
@DevSinghSachan thank you for the info about the dataset. Is it possibile to add a specific new language, like let's say hi
or ja
? Thanks again.
@loretoparisi I think both "hi" and "ja" languages should be present in this dataset, but adding new languages in the current format of the dataset may not be easy, as we have not retained the time-id information from the TED talks transcripts, which is crucial in proper sentence alignment.
@DevSinghSachan okay thank you. I have checked the current supported language from the header file
talk_name en es pt-br fr ru he ar ko zh-cn it ja zh-tw nl ro tr de vi pl pt bg el fa sr hu hr uk cs id th sv sk sq lt da calv my sl mk fr-ca fi hy hi nb ka mn et ku gl mr zh ur eo ms az ta bn kk be eu bs
Thank ok closing.
Hello, thank you for this repository of TED talk languages parallel corpus! I'm interested in specific languages (like Hindi or Urdu, etc.), I'm not sure which is the process to retrieve these datasets. I'm aware of the transcript and translation processes as described here and on the WIT^3 we site here, but assumed I have specific talk like this in Tamil, how did you automatically get the translations/transcript?
My aim is to augment your dataset with more missing languages, specifically I'm looking for the following talks translations, where I have checked the number of talks available at present time:
Thank you very much.