Open yacinebouaouni opened 1 year ago
Here is the Huggingface repository that I have created for the pubmed abstract dataset that you may want to look at:
from datasets import load_dataset pubmed_dataset = load_dataset("qualis2006/PUBMED_title_abstracts_2020_baseline") pubmed_dataset
Downloading data: 100% 7.98G/7.98G [11:47<00:00, 9.68MB/s] Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]
DatasetDict({ train: Dataset({ features: ['meta', 'text'], num_rows: 17722096 })
@qualis2006 Nice! Thanks. On my end, it works using your code, and then I need to call pubmed_dataset['train']
instead of pubmed_dataset
throughout the rest of the page.
To run the code as is on the page, we can download the dataset with the full URL.
data_files="https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst"
@yacinebouaouni this line should work.
The link provided in Section 5 / Big data? 🤗 Datasets to the rescue! :
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
is broken