huggingface / course

The Hugging Face course on Transformers
https://huggingface.co/course
Apache License 2.0
2.23k stars 740 forks source link

Broken Link to PubMed Abstracts dataset #623

Open yacinebouaouni opened 1 year ago

yacinebouaouni commented 1 year ago

The link provided in Section 5 / Big data? 🤗 Datasets to the rescue! : data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst" is broken

qualis2006 commented 10 months ago

Here is the Huggingface repository that I have created for the pubmed abstract dataset that you may want to look at:

from datasets import load_dataset pubmed_dataset = load_dataset("qualis2006/PUBMED_title_abstracts_2020_baseline") pubmed_dataset

Downloading data: 100% 7.98G/7.98G [11:47<00:00, 9.68MB/s] Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]

DatasetDict({ train: Dataset({ features: ['meta', 'text'], num_rows: 17722096 })

Mik-TF commented 8 months ago

@qualis2006 Nice! Thanks. On my end, it works using your code, and then I need to call pubmed_dataset['train'] instead of pubmed_dataset throughout the rest of the page.

To run the code as is on the page, we can download the dataset with the full URL.

data_files="https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst"

@yacinebouaouni this line should work.