Dataset and/or extraction script

oussamaahmia / TED-dataset

MIT License

6 stars 0 forks source link

Dataset and/or extraction script #1

Closed edugonza closed 5 years ago

edugonza commented 5 years ago

Hi!

I came across your paper (https://www.aclweb.org/anthology/L18-1583) and I found the link to this repository. I think that this dataset could be very useful for NLP analysis. I was wondering if you have plans to publish the full dataset and/or the scripts that you used to extract the data from the official API. It would be of great help.

Thanks a lot Best regards, Edu

oussamaahmia commented 5 years ago

Hi @edugonza The full dataset will be available no later than October 2019 (A Python API will also be provided for easier manipulation). Concerning the extraction script, I need to discuss with the company (OctopusMind) what part of the code i can publish and make change to the script accordingly. Best regards, Oussama Ahmia

edugonza commented 5 years ago

hi @oussamaahmia, That is great. I am looking forward to experiment with that data. Thanks a lot. Best regards, Edu

AnifInnab commented 5 years ago

Hello @oussamaahmia ,

October is soon coming to its end, is the data being released as scheduled? :)

Thanks!

oussamaahmia commented 5 years ago

Hello @AnifInnab, I wish to inform you that we had some technical issues, that will cause some delay. For now, I am uploading a part of the dataset, including: par-TED (corpus of sentences translated to 24 languages. ), fd-TED in French and English versions (text classification dataset). I will put a download link as soon as the upload is completed (by tomorrow). The rest of the dataset will follow as soon as it is processed (including new documents released between 2018 and 2019). I remain available for any further information you may need.

AnifInnab commented 5 years ago

Amazing, many thanks! Since we're working in Swedish, the corpus will be good and sufficient enough for now. I'll get back to you if any questions comes to my mind.

Much appreciated!