iliaschalkidis / lmtc-eurlex57k

Large-Scale Multi-Label Text Classification on EU Legislation
Apache License 2.0
92 stars 10 forks source link

Missing data from downloading website #21

Open NoobVic opened 1 year ago

NoobVic commented 1 year ago

When I try to download the raw version from http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip

The train only contains 18,234 json files instead of 45,000. Also, the dev folder is missing.

iliaschalkidis commented 1 year ago

Hi @NoobVic,

You can find the EURLEX-57K dataset on HuggingFace here (https://huggingface.co/datasets/eurlex), and you can also find the updated multilingual version (MultiEURLEX, https://aclanthology.org/2021.emnlp-main.559/), which includes 65k documents, here (https://huggingface.co/datasets/nlpaueb/multi_eurlex).