Open hrh-bbc-rd opened 1 year ago
I have been able to continue doing the course by using this link instead
data_files = "https://the-eye.eu/public/AI/pile_v2/data/NIH_ExPORTER_awarded_grant_text.jsonl.zst"
Looks like this URL changing and breaking the link has been an issue before (see #324)
Note that there is another broken link further down the page on this line in the following code block:
law_dataset_streamed = load_dataset(
"json",
data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
split="train",
streaming=True,
)
next(iter(law_dataset_streamed))
Same issue here, looks like the pile has been taken down due to copyright reasons.
The link to the PubMed Abstracts Database is broken in the Chapter 5 Section 4 'Big Datasets Chapter'.
Broken link in question found in
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
Chapter here