Open sameemqureshi opened 11 months ago
This has already been reported in the HF Course repo (https://github.com/huggingface/course/issues/623).
@lhoestq @albertvillanova @lewtun I don't think we are allowed to host these data files on the Hub (due to DMCA), which means the only option is to use a different dataset in the course (and to re-record the video 🙂), no?
Keeping the video is maybe fine, we can add a note on youtube to suggest to load a dataset with a different name. Maybe C4 ? And update the code snippets on the website ?
Maybe you want to try it with the PUBMED dataset that I reproduced based on the The PubMed Abstract GitHub Site and uploaded on the HuggingFace:
from datasets import load_dataset
pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline")
pubmed_dataset
#Downloading data: 100%
#7.98G/7.98G [11:47<00:00, 9.68MB/s]
#Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]
#DatasetDict({
# train: Dataset({
# features: ['meta', 'text'],
# num_rows: 17722096
# })
#})
å”令涛说感谢感谢
Describe the bug
The link provided for the dataset is broken, data_files = https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst
The
Steps to reproduce the bug
Steps to reproduce:
1) Head over to https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt#big-data-datasets-to-the-rescue
2) In the Section "What is the Pile?", you can see a code snippet that contains the broken link.
Expected behavior
The link should Redirect to the "PubMed Abstracts dataset" as expected .
Environment info
.