huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.1k stars 2.66k forks source link

NIH exporter file not found #6144

Open brando90 opened 1 year ago

brando90 commented 1 year ago

Describe the bug

can't use or download the nih exporter pile data.

15     experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
16   File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 474, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
17     column_names = next(iter(dataset)).keys()
18   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
19     for key, example in ex_iterable:
20   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 207, in __iter__
21     yield from self.generate_examples_fn(**self.kwargs)
22   File "/lfs/ampere1/0/brando9/.cache/huggingface/modules/datasets_modules/datasets/EleutherAI--pile/ebea56d358e91cf4d37b0fde361d563bed1472fbd8221a21b38fc8bb4ba554fb/pile.py", line 236, in _generate_examples
23     with zstd.open(open(files[subset], "rb"), "rt", encoding="utf-8") as f:
24   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/streaming.py", line 74, in wrapper
25     return function(*args, download_config=download_config, **kwargs)
26   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
27     file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
28   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 134, in open
29     return self.__enter__()
30   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 102, in __enter__
31     f = self.fs.open(self.path, mode=mode)
32   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/spec.py", line 1241, in open
33     f = self._open(
34   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 356, in _open
35     size = size or self.info(path, **kwargs)["size"]
36   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
37     return sync(self.loop, func, *args, **kwargs)
38   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
39     raise return_result
40   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
41     result[0] = await coro
42   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 430, in _info
43     raise FileNotFoundError(url) from exc
44 FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst

Steps to reproduce the bug

run this:

from datasets import load_dataset
path, name = 'EleutherAI/pile', 'nih_exporter'

# -- Get data set
dataset = load_dataset(path, name, streaming=True, split="train").with_format("torch")
batch = dataset.take(512)
print(f'{batch=}')

Expected behavior

print the batch

Environment info

(beyond_scale) brando9@ampere1:~/beyond-scale-language-data-diversity$ datasets-cli env

Copy-and-paste the text below in your GitHub issue.

- `datasets` version: 2.14.4
- Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 2.0.3
brando90 commented 1 year ago

related: https://github.com/huggingface/datasets/issues/3504

brando90 commented 1 year ago

another file not found:

Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 417, in _info
    await _file_info(
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 837, in _file_info
    r.raise_for_status()
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://the-eye.eu/public/AI/pile_preliminary_components/pile_uspto.tar')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/lfs/ampere1/0/brando9/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 526, in <module>
    experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
  File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 475, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
    column_names = next(iter(dataset)).keys()
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
    for key, example in ex_iterable:
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 207, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/lfs/ampere1/0/brando9/.cache/huggingface/modules/datasets_modules/datasets/EleutherAI--pile/ebea56d358e91cf4d37b0fde361d563bed1472fbd8221a21b38fc8bb4ba554fb/pile.py", line 257, in _generate_examples
    for path, file in files[subset]:
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__
    yield from self.generator(*self.args, **self.kwargs)
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 891, in _iter_from_urlpath
    with xopen(urlpath, "rb", download_config=download_config) as f:
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
    file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 134, in open
    return self.__enter__()
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/spec.py", line 1241, in open
    f = self._open(
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 356, in _open
    size = size or self.info(path, **kwargs)["size"]
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 430, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/pile_uspto.tar
brando90 commented 1 year ago
FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/pile_uspto.tar

most relevant line I think.

brando90 commented 1 year ago

link to tweet: https://twitter.com/BrandoHablando/status/1690081313519489024?s=20 about issue

brando90 commented 1 year ago

so: https://stackoverflow.com/questions/76891189/how-to-download-data-from-hugging-face-that-is-visible-on-the-data-viewer-but-th

brando90 commented 1 year ago

this seems to work but it's rather annoying.

Summary of how to make it work:

  1. get urls to parquet files into a list
  2. load list to load_dataset via load_dataset('parquet', data_files=urls) (note api names to hf are really confusing sometimes)
  3. then it should work, print a batch of text.

presudo code

urls_hacker_news = [
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00000-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00001-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00002-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00003-of-00004.parquet"
]

...

    # streaming = False
    from diversity.pile_subset_urls import urls_hacker_news
    path, name, data_files = 'parquet', 'hacker_news', urls_hacker_news
    # not changing
    batch_size = 512
    today = datetime.datetime.now().strftime('%Y-m%m-d%d-t%Hh_%Mm_%Ss')
    run_name = f'{path} div_coeff_{num_batches=} ({today=} ({name=}) {data_mixture_name=} {probabilities=})'
    print(f'{run_name=}')

    # - Init wandb
    debug: bool = mode == 'dryrun'
    run = wandb.init(mode=mode, project="beyond-scale", name=run_name, save_code=True)
    wandb.config.update({"num_batches": num_batches, "path": path, "name": name, "today": today, 'probabilities': probabilities, 'batch_size': batch_size, 'debug': debug, 'data_mixture_name': data_mixture_name, 'streaming': streaming, 'data_files': data_files})
    # run.notify_on_failure() # https://community.wandb.ai/t/how-do-i-set-the-wandb-alert-programatically-for-my-current-run/4891
    print(f'{debug=}')
    print(f'{wandb.config=}')

    # -- Get probe network
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    # -- Get data set
    def my_load_dataset(path, name):
        print(f'{path=} {name=} {streaming=}')
        if path == 'json' or path == 'bin' or path == 'csv':
            print(f'{data_files_prefix+name=}')
            return load_dataset(path, data_files=data_files_prefix+name, streaming=streaming, split="train").with_format("torch")
        elif path == 'parquet':
            print(f'{data_files=}')
            return load_dataset(path, data_files=data_files, streaming=streaming, split="train").with_format("torch")
        else:
            return load_dataset(path, name, streaming=streaming, split="train").with_format("torch")
    # - get data set for real now
    if isinstance(path, str):
        dataset = my_load_dataset(path, name)
    else:
        print('-- interleaving datasets')
        datasets = [my_load_dataset(path, name).with_format("torch") for path, name in zip(path, name)]
        [print(f'{dataset.description=}') for dataset in datasets]
        dataset = interleave_datasets(datasets, probabilities)
    print(f'{dataset=}')
    batch = dataset.take(batch_size)
    print(f'{next(iter(batch))=}')
    column_names = next(iter(batch)).keys()
    print(f'{column_names=}')

    # - Prepare functions to tokenize batch
    def preprocess(examples):
        return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(preprocess, batched=True, remove_columns=remove_columns)
    tokenized_batch = map(batch)
    print(f'{next(iter(tokenized_batch))=}')

https://stackoverflow.com/questions/76891189/how-to-download-data-from-hugging-face-that-is-visible-on-the-data-viewer-but-th/76902681#76902681

https://discuss.huggingface.co/t/how-to-download-data-from-hugging-face-that-is-visible-on-the-data-viewer-but-the-files-are-not-available/50555/5?u=severo