huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.63k forks source link

Broken Link to PubMed Abstracts dataset . #6273

Open sameemqureshi opened 11 months ago

sameemqureshi commented 11 months ago

Describe the bug

The link provided for the dataset is broken, data_files = https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

The

Steps to reproduce the bug

Steps to reproduce:

1) Head over to https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt#big-data-datasets-to-the-rescue

2) In the Section "What is the Pile?", you can see a code snippet that contains the broken link.

Expected behavior

The link should Redirect to the "PubMed Abstracts dataset" as expected .

Environment info

.

mariosasko commented 11 months ago

This has already been reported in the HF Course repo (https://github.com/huggingface/course/issues/623).

mariosasko commented 11 months ago

@lhoestq @albertvillanova @lewtun I don't think we are allowed to host these data files on the Hub (due to DMCA), which means the only option is to use a different dataset in the course (and to re-record the video 🙂), no?

lhoestq commented 11 months ago

Keeping the video is maybe fine, we can add a note on youtube to suggest to load a dataset with a different name. Maybe C4 ? And update the code snippets on the website ?

qualis2006 commented 8 months ago

Maybe you want to try it with the PUBMED dataset that I reproduced based on the The PubMed Abstract GitHub Site and uploaded on the HuggingFace:

from datasets import load_dataset
pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline")
pubmed_dataset

#Downloading data: 100%
#7.98G/7.98G [11:47<00:00, 9.68MB/s]
#Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]

#DatasetDict({
#   train: Dataset({
#        features: ['meta', 'text'],
#        num_rows: 17722096
#    })
#})
cuntoushifu commented 4 months ago

孔令涛说感谢感谢