TED-dataset

The two sub-datasets, fd-TED and par-TED, will be updated in a regular basis to keep tracks of the new calls for tender published by the EU states.

The par-TED is a multilingual (24 languages) aligned corpus in the form of a set of parallel unique sentences translated to at least 23 languages.
The fd-TED corpus is built from the full content of the documents extracted from the TED − Tenders Electronic Daily platform. This dataset can be used as a benchmark for supervised classification or for training machine learning models applied to business intelligence application. We also propose a filtered version of fd-ted created by ignoring administrative information.

For further information please refer to this article.

Citation: \ @inproceedings{ahmia-etal-2018-two, title = "Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications.", author = "Ahmia, Oussama and B{\'e}chet, Nicolas and Marteau, Pierre-Fran{\c{c}}ois", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1583", }

oussamaahmia / TED-dataset