Unable to download PUBMED_title_abstracts_2019_baseline.jsonl.zst

ToddMorrill commented 2 years ago

Describe the bug

I am unable to download the PubMed dataset from the link provided in the Hugging Face Course (Chapter 5 Section 4).

https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

Steps to reproduce the bug

# Sample code to reproduce the bug
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

I also tried with wget as follows.

wget https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

Expected results

I expect to be able to download this file.

Actual results

Traceback

---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
/usr/lib/python3/dist-packages/urllib3/connection.py in _new_conn(self)
    158         try:
--> 159             conn = connection.create_connection(
    160                 (self._dns_host, self.port), self.timeout, **extra_kw

/usr/lib/python3/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     83     if err is not None:
---> 84         raise err
     85 

/usr/lib/python3/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     73                 sock.bind(source_address)
---> 74             sock.connect(sa)
     75             return sock

timeout: timed out

During handling of the above exception, another exception occurred:

ConnectTimeoutError                       Traceback (most recent call last)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    664             # Make the request on the httplib connection object.
--> 665             httplib_response = self._make_request(
    666                 conn,

/usr/lib/python3/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    375         try:
--> 376             self._validate_conn(conn)
    377         except (SocketTimeout, BaseSSLError) as e:

/usr/lib/python3/dist-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    995         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 996             conn.connect()
    997 

/usr/lib/python3/dist-packages/urllib3/connection.py in connect(self)
    313         # Add certificate verification
--> 314         conn = self._new_conn()
    315         hostname = self.host

/usr/lib/python3/dist-packages/urllib3/connection.py in _new_conn(self)
    163         except SocketTimeout:
--> 164             raise ConnectTimeoutError(
    165                 self,

ConnectTimeoutError: (<urllib3.connection.VerifiedHTTPSConnection object at 0x7f06dd698850>, 'Connection to the-eye.eu timed out. (connect timeout=10.0)')

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
/usr/lib/python3/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    438             if not chunked:
--> 439                 resp = conn.urlopen(
    440                     method=request.method,

/usr/lib/python3/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    718 
--> 719             retries = retries.increment(
    720                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

/usr/lib/python3/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    435         if new_retry.is_exhausted():
--> 436             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    437 

MaxRetryError: HTTPSConnectionPool(host='the-eye.eu', port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f06dd698850>, 'Connection to the-eye.eu timed out. (connect timeout=10.0)'))

During handling of the above exception, another exception occurred:

ConnectTimeout                            Traceback (most recent call last)
/tmp/ipykernel_15104/606583593.py in <module>
      3 # This takes a few minutes to run, so go grab a tea or coffee while you wait :)
      4 data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
----> 5 pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
      6 pubmed_dataset

~/.local/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, script_version, **config_kwargs)
   1655 
   1656     # Create a dataset builder
-> 1657     builder_instance = load_dataset_builder(
   1658         path=path,
   1659         name=name,

~/.local/lib/python3.8/site-packages/datasets/load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, script_version, **config_kwargs)
   1492         download_config = download_config.copy() if download_config else DownloadConfig()
   1493         download_config.use_auth_token = use_auth_token
-> 1494     dataset_module = dataset_module_factory(
   1495         path, revision=revision, download_config=download_config, download_mode=download_mode, data_files=data_files
   1496     )

~/.local/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_files, **download_kwargs)
   1116     # Try packaged
   1117     if path in _PACKAGED_DATASETS_MODULES:
-> 1118         return PackagedDatasetModuleFactory(
   1119             path, data_files=data_files, download_config=download_config, download_mode=download_mode
   1120         ).get_module()

~/.local/lib/python3.8/site-packages/datasets/load.py in get_module(self)
    773             else get_patterns_locally(str(Path().resolve()))
    774         )
--> 775         data_files = DataFilesDict.from_local_or_remote(patterns, use_auth_token=self.downnload_config.use_auth_token)
    776         module_path, hash = _PACKAGED_DATASETS_MODULES[self.name]
    777         builder_kwargs = {"hash": hash, "data_files": data_files}

~/.local/lib/python3.8/site-packages/datasets/data_files.py in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    576         for key, patterns_for_key in patterns.items():
    577             out[key] = (
--> 578                 DataFilesList.from_local_or_remote(
    579                     patterns_for_key,
    580                     base_path=base_path,

~/.local/lib/python3.8/site-packages/datasets/data_files.py in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    545         base_path = base_path if base_path is not None else str(Path().resolve())
    546         data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
--> 547         origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
    548         return cls(data_files, origin_metadata)
    549 

~/.local/lib/python3.8/site-packages/datasets/data_files.py in _get_origin_metadata_locally_or_by_urls(data_files, max_workers, use_auth_token)
    492     data_files: List[Union[Path, Url]], max_workers=64, use_auth_token: Optional[Union[bool, str]] = None
    493 ) -> Tuple[str]:
--> 494     return thread_map(
    495         partial(_get_single_origin_metadata_locally_or_by_urls, use_auth_token=use_auth_token),
    496         data_files,

~/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py in thread_map(fn, *iterables, **tqdm_kwargs)
     92     """
     93     from concurrent.futures import ThreadPoolExecutor
---> 94     return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
     95 
     96 

~/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py in _executor_map(PoolExecutor, fn, *iterables, **tqdm_kwargs)
     74             map_args.update(chunksize=chunksize)
     75         with PoolExecutor(**pool_kwargs) as ex:
---> 76             return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
     77 
     78 

~/.local/lib/python3.8/site-packages/tqdm/notebook.py in __iter__(self)
    252     def __iter__(self):
    253         try:
--> 254             for obj in super(tqdm_notebook, self).__iter__():
    255                 # return super(tqdm...) will not catch exception
    256                 yield obj

~/.local/lib/python3.8/site-packages/tqdm/std.py in __iter__(self)
   1171         # (note: keep this check outside the loop for performance)
   1172         if self.disable:
-> 1173             for obj in iterable:
   1174                 yield obj
   1175             return

/usr/lib/python3.8/concurrent/futures/_base.py in result_iterator()
    617                     # Careful not to keep a reference to the popped future
    618                     if timeout is None:
--> 619                         yield fs.pop().result()
    620                     else:
    621                         yield fs.pop().result(end_time - time.monotonic())

/usr/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    442                     raise CancelledError()
    443                 elif self._state == FINISHED:
--> 444                     return self.__get_result()
    445                 else:
    446                     raise TimeoutError()

/usr/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    387         if self._exception:
    388             try:
--> 389                 raise self._exception
    390             finally:
    391                 # Break a reference cycle with the exception in self._exception

/usr/lib/python3.8/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/.local/lib/python3.8/site-packages/datasets/data_files.py in _get_single_origin_metadata_locally_or_by_urls(data_file, use_auth_token)
    483     if isinstance(data_file, Url):
    484         data_file = str(data_file)
--> 485         return (request_etag(data_file, use_auth_token=use_auth_token),)
    486     else:
    487         data_file = str(data_file.resolve())

~/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py in request_etag(url, use_auth_token)
    489 def request_etag(url: str, use_auth_token: Optional[Union[str, bool]] = None) -> Optional[str]:
    490     headers = get_authentication_headers_for_url(url, use_auth_token=use_auth_token)
--> 491     response = http_head(url, headers=headers, max_retries=3)
    492     response.raise_for_status()
    493     etag = response.headers.get("ETag") if response.ok else None

~/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py in http_head(url, proxies, headers, cookies, allow_redirects, timeout, max_retries)
    474     headers = copy.deepcopy(headers) or {}
    475     headers["user-agent"] = get_datasets_user_agent(user_agent=headers.get("user-agent"))
--> 476     response = _request_with_retry(
    477         method="HEAD",
    478         url=url,

~/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py in _request_with_retry(method, url, max_retries, base_wait_time, max_wait_time, timeout, **params)
    407         except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError) as err:
    408             if tries > max_retries:
--> 409                 raise err
    410             else:
    411                 logger.info(f"{method} request to {url} timed out, retrying... [{tries/max_retries}]")

~/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py in _request_with_retry(method, url, max_retries, base_wait_time, max_wait_time, timeout, **params)
    403         tries += 1
    404         try:
--> 405             response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
    406             success = True
    407         except (requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError) as err:

/usr/lib/python3/dist-packages/requests/api.py in request(method, url, **kwargs)
     58     # cases, and look like a memory leak in others.
     59     with sessions.Session() as session:
---> 60         return session.request(method=method, url=url, **kwargs)
     61 
     62 

/usr/lib/python3/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    531         }
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 
    535         return resp

/usr/lib/python3/dist-packages/requests/sessions.py in send(self, request, **kwargs)
    644 
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 
    648         # Total elapsed time of the request (approximately)

/usr/lib/python3/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    502                 # TODO: Remove this in 3.0.0: see #2811
    503                 if not isinstance(e.reason, NewConnectionError):
--> 504                     raise ConnectTimeout(e, request=request)
    505 
    506             if isinstance(e.reason, ResponseError):

ConnectTimeout: HTTPSConnectionPool(host='the-eye.eu', port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f06dd698850>, 'Connection to the-eye.eu timed out. (connect timeout=10.0)'))

Environment info

datasets version: 1.17.0
Platform: Linux-5.11.0-43-generic-x86_64-with-glibc2.29
Python version: 3.8.10
PyArrow version: 6.0.1

albertvillanova commented 2 years ago

Hi @ToddMorrill, thanks for reporting.

Three weeks ago I contacted the team who created the Pile dataset to report this issue with their data host server: https://the-eye.eu

They told me that unfortunately, the-eye was heavily affected by the recent tornado catastrophe in the US. They hope to have their data back online asap.

albertvillanova commented 2 years ago

Hi @ToddMorrill, people from the Pile team have mirrored their data in a new host server: https://mystic.the-eye.eu

See:

3627

It should work if you update your URL.

We should also update the URL in our course material.

mwunderlich commented 2 years ago

The old URL is still present in the HuggingFace course here: https://huggingface.co/course/chapter5/4?fw=pt

I have created a PR for the Notebook here: https://github.com/huggingface/notebooks/pull/148 Not sure if the HTML is in a public repo. I wasn't able to find it.

mwunderlich commented 2 years ago

Fixed the other two URLs here: https://github.com/mwunderlich/notebooks/pull/1

nickovchinnikov commented 1 year ago

Both URLs are broken now HTTPError: 404 Client Error: Not Found for URL: https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst And ConnectTimeout: HTTPSConnectionPool(host='mystic.the-eye.eu', port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by ConnectTimeoutError(, 'Connection to mystic.the-eye.eu timed out. (connect timeout=10.0)'))

brunotessaro commented 1 year ago

I was able to find a torrent with "The Pile" dataset here: The Pile An 800GB Dataset of Diverse Text for Language Modeling

The complete dataset is huge, so I would suggest you to download only the "PUBMED_title_abstracts_2019_baseline.jsonl.zst" file, which is about 7GB. You can do this by using a torrent client of your choice (I typically utilize Transmission, which is pre-installed in Ubuntu distributions).

brando90 commented 1 year ago

@albertvillanova another issue:

15     experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
16   File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 474, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
17     column_names = next(iter(dataset)).keys()
18   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1353, in __iter__
19     for key, example in ex_iterable:
20   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 207, in __iter__
21     yield from self.generate_examples_fn(**self.kwargs)
22   File "/lfs/ampere1/0/brando9/.cache/huggingface/modules/datasets_modules/datasets/EleutherAI--pile/ebea56d358e91cf4d37b0fde361d563bed1472fbd8221a21b38fc8bb4ba554fb/pile.py", line 236, in _generate_examples
23     with zstd.open(open(files[subset], "rb"), "rt", encoding="utf-8") as f:
24   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/streaming.py", line 74, in wrapper
25     return function(*args, download_config=download_config, **kwargs)
26   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
27     file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
28   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 134, in open
29     return self.__enter__()
30   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/core.py", line 102, in __enter__
31     f = self.fs.open(self.path, mode=mode)
32   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/spec.py", line 1241, in open
33     f = self._open(
34   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 356, in _open
35     size = size or self.info(path, **kwargs)["size"]
36   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
37     return sync(self.loop, func, *args, **kwargs)
38   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
39     raise return_result
40   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
41     result[0] = await coro
42   File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/site-packages/fsspec/implementations/http.py", line 430, in _info
43     raise FileNotFoundError(url) from exc
44 FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst

any suggestions?

brando90 commented 1 year ago

this seems to work but it's rather annoying.

Summary of how to make it work:

get urls to parquet files into a list
load list to load_dataset via load_dataset('parquet', data_files=urls) (note api names to hf are really confusing sometimes)
then it should work, print a batch of text.

presudo code

urls_hacker_news = [
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00000-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00001-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00002-of-00004.parquet",
    "https://huggingface.co/datasets/EleutherAI/pile/resolve/refs%2Fconvert%2Fparquet/hacker_news/pile-train-00003-of-00004.parquet"
]

...

    # streaming = False
    from diversity.pile_subset_urls import urls_hacker_news
    path, name, data_files = 'parquet', 'hacker_news', urls_hacker_news
    # not changing
    batch_size = 512
    today = datetime.datetime.now().strftime('%Y-m%m-d%d-t%Hh_%Mm_%Ss')
    run_name = f'{path} div_coeff_{num_batches=} ({today=} ({name=}) {data_mixture_name=} {probabilities=})'
    print(f'{run_name=}')

    # - Init wandb
    debug: bool = mode == 'dryrun'
    run = wandb.init(mode=mode, project="beyond-scale", name=run_name, save_code=True)
    wandb.config.update({"num_batches": num_batches, "path": path, "name": name, "today": today, 'probabilities': probabilities, 'batch_size': batch_size, 'debug': debug, 'data_mixture_name': data_mixture_name, 'streaming': streaming, 'data_files': data_files})
    # run.notify_on_failure() # https://community.wandb.ai/t/how-do-i-set-the-wandb-alert-programatically-for-my-current-run/4891
    print(f'{debug=}')
    print(f'{wandb.config=}')

    # -- Get probe network
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    # -- Get data set
    def my_load_dataset(path, name):
        print(f'{path=} {name=} {streaming=}')
        if path == 'json' or path == 'bin' or path == 'csv':
            print(f'{data_files_prefix+name=}')
            return load_dataset(path, data_files=data_files_prefix+name, streaming=streaming, split="train").with_format("torch")
        elif path == 'parquet':
            print(f'{data_files=}')
            return load_dataset(path, data_files=data_files, streaming=streaming, split="train").with_format("torch")
        else:
            return load_dataset(path, name, streaming=streaming, split="train").with_format("torch")
    # - get data set for real now
    if isinstance(path, str):
        dataset = my_load_dataset(path, name)
    else:
        print('-- interleaving datasets')
        datasets = [my_load_dataset(path, name).with_format("torch") for path, name in zip(path, name)]
        [print(f'{dataset.description=}') for dataset in datasets]
        dataset = interleave_datasets(datasets, probabilities)
    print(f'{dataset=}')
    batch = dataset.take(batch_size)
    print(f'{next(iter(batch))=}')
    column_names = next(iter(batch)).keys()
    print(f'{column_names=}')

    # - Prepare functions to tokenize batch
    def preprocess(examples):
        return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(preprocess, batched=True, remove_columns=remove_columns)
    tokenized_batch = map(batch)
    print(f'{next(iter(tokenized_batch))=}')

https://stackoverflow.com/questions/76891189/how-to-download-data-from-hugging-face-that-is-visible-on-the-data-viewer-but-th/76902681#76902681

https://discuss.huggingface.co/t/how-to-download-data-from-hugging-face-that-is-visible-on-the-data-viewer-but-the-files-are-not-available/50555/5?u=severo

casinca commented 5 months ago

If some people stumble upon this thread and still have this problem, i reuploaded the dataset to HF here

Its the exact same dataset you just have to change the url from the course, for example:

from datasets import load_dataset, DownloadConfig

data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset(
    "json",
    data_files=data_files,
    split="train",
    download_config=DownloadConfig(delete_extracted=True),  # optional argument
)

huggingface / datasets