Closed ToddMorrill closed 2 years ago
Hi @ToddMorrill, thanks for reporting.
Three weeks ago I contacted the team who created the Pile dataset to report this issue with their data host server: https://the-eye.eu
They told me that unfortunately, the-eye was heavily affected by the recent tornado catastrophe in the US. They hope to have their data back online asap.
Hi @ToddMorrill, people from the Pile team have mirrored their data in a new host server: https://mystic.the-eye.eu
See:
It should work if you update your URL.
We should also update the URL in our course material.
The old URL is still present in the HuggingFace course here: https://huggingface.co/course/chapter5/4?fw=pt
I have created a PR for the Notebook here: https://github.com/huggingface/notebooks/pull/148 Not sure if the HTML is in a public repo. I wasn't able to find it.
Fixed the other two URLs here: https://github.com/mwunderlich/notebooks/pull/1
Both URLs are broken now
HTTPError: 404 Client Error: Not Found for URL: https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst
And
ConnectTimeout: HTTPSConnectionPool(host='mystic.the-eye.eu', port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by ConnectTimeoutError(, 'Connection to mystic.the-eye.eu timed out. (connect timeout=10.0)'))
I was able to find a torrent with "The Pile" dataset here: The Pile An 800GB Dataset of Diverse Text for Language Modeling
The complete dataset is huge, so I would suggest you to download only the "PUBMED_title_abstracts_2019_baseline.jsonl.zst" file, which is about 7GB. You can do this by using a torrent client of your choice (I typically utilize Transmission, which is pre-installed in Ubuntu distributions).
@albertvillanova another issue:
15 experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
16 File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 474, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
17 column_names = next(iter(dataset)).keys()
18 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
19 for key, example in ex_iterable:
20 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 207, in __iter__
21 yield from self.generate_examples_fn(**self.kwargs)
22 File "/lfs/ampere1/0/brando9/.cache/huggingface/modules/datasets_modules/datasets/EleutherAI--pile/ebea56d358e91cf4d37b0fde361d563bed1472fbd8221a21b38fc8bb4ba554fb/pile.py", line 236, in _generate_examples
23 with zstd.open(open(files[subset], "rb"), "rt", encoding="utf-8") as f:
24 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/streaming.py", line 74, in wrapper
25 return function(*args, download_config=download_config, **kwargs)
26 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
27 file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
28 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 134, in open
29 return self.__enter__()
30 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 102, in __enter__
31 f = self.fs.open(self.path, mode=mode)
32 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/spec.py", line 1241, in open
33 f = self._open(
34 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 356, in _open
35 size = size or self.info(path, **kwargs)["size"]
36 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
37 return sync(self.loop, func, *args, **kwargs)
38 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
39 raise return_result
40 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
41 result[0] = await coro
42 File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 430, in _info
43 raise FileNotFoundError(url) from exc
44 FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst
any suggestions?
this seems to work but it's rather annoying.
Summary of how to make it work:
load_dataset('parquet', data_files=urls)
(note api names to hf are really confusing sometimes)presudo code
urls_hacker_news = [
"https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00000-of-00004.parquet",
"https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00001-of-00004.parquet",
"https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00002-of-00004.parquet",
"https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00003-of-00004.parquet"
]
...
# streaming = False
from diversity.pile_subset_urls import urls_hacker_news
path, name, data_files = 'parquet', 'hacker_news', urls_hacker_news
# not changing
batch_size = 512
today = datetime.datetime.now().strftime('%Y-m%m-d%d-t%Hh_%Mm_%Ss')
run_name = f'{path} div_coeff_{num_batches=} ({today=} ({name=}) {data_mixture_name=} {probabilities=})'
print(f'{run_name=}')
# - Init wandb
debug: bool = mode == 'dryrun'
run = wandb.init(mode=mode, project="beyond-scale", name=run_name, save_code=True)
wandb.config.update({"num_batches": num_batches, "path": path, "name": name, "today": today, 'probabilities': probabilities, 'batch_size': batch_size, 'debug': debug, 'data_mixture_name': data_mixture_name, 'streaming': streaming, 'data_files': data_files})
# run.notify_on_failure() # https://community.wandb.ai/t/how-do-i-set-the-wandb-alert-programatically-for-my-current-run/4891
print(f'{debug=}')
print(f'{wandb.config=}')
# -- Get probe network
from datasets import load_dataset
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
probe_network = probe_network.to(device)
# -- Get data set
def my_load_dataset(path, name):
print(f'{path=} {name=} {streaming=}')
if path == 'json' or path == 'bin' or path == 'csv':
print(f'{data_files_prefix+name=}')
return load_dataset(path, data_files=data_files_prefix+name, streaming=streaming, split="train").with_format("torch")
elif path == 'parquet':
print(f'{data_files=}')
return load_dataset(path, data_files=data_files, streaming=streaming, split="train").with_format("torch")
else:
return load_dataset(path, name, streaming=streaming, split="train").with_format("torch")
# - get data set for real now
if isinstance(path, str):
dataset = my_load_dataset(path, name)
else:
print('-- interleaving datasets')
datasets = [my_load_dataset(path, name).with_format("torch") for path, name in zip(path, name)]
[print(f'{dataset.description=}') for dataset in datasets]
dataset = interleave_datasets(datasets, probabilities)
print(f'{dataset=}')
batch = dataset.take(batch_size)
print(f'{next(iter(batch))=}')
column_names = next(iter(batch)).keys()
print(f'{column_names=}')
# - Prepare functions to tokenize batch
def preprocess(examples):
return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
remove_columns = column_names # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
def map(batch):
return batch.map(preprocess, batched=True, remove_columns=remove_columns)
tokenized_batch = map(batch)
print(f'{next(iter(tokenized_batch))=}')
If some people stumble upon this thread and still have this problem, i reuploaded the dataset to HF here
Its the exact same dataset you just have to change the url from the course, for example:
from datasets import load_dataset, DownloadConfig
data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset(
"json",
data_files=data_files,
split="train",
download_config=DownloadConfig(delete_extracted=True), # optional argument
)
Describe the bug
I am unable to download the PubMed dataset from the link provided in the Hugging Face Course (Chapter 5 Section 4).
https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst
Steps to reproduce the bug
I also tried with
wget
as follows.Expected results
I expect to be able to download this file.
Actual results
Traceback
Environment info
datasets
version: 1.17.0