huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

When len(_URLS) > 16, download will hang #5270

Open Freed-Wu opened 1 year ago

Freed-Wu commented 1 year ago

Describe the bug

In [9]: dataset = load_dataset('Freed-Wu/kodak', split='test')
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.53k/2.53k [00:00<00:00, 1.88MB/s]
[11/19/22 22:16:21] WARNING  Using custom data configuration default                                                                                                                                         builder.py:379
Downloading and preparing dataset kodak/default to /home/wzy/.cache/huggingface/datasets/Freed-Wu___kodak/default/0.0.1/bd1cc3434212e3e654f7e16ad618f8a1470b5982b086c91b1d6bc7187183c6e9...
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 531k/531k [00:02<00:00, 239kB/s]
#10: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.06s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 534k/534k [00:02<00:00, 193kB/s]
#14: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.37s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 692k/692k [00:02<00:00, 269kB/s]
#12: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.44s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 566k/566k [00:02<00:00, 210kB/s]
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.53s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 613k/613k [00:02<00:00, 235kB/s]
#13: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.53s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 786k/786k [00:02<00:00, 342kB/s]
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.60s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 619k/619k [00:02<00:00, 254kB/s]
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.68s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 737k/737k [00:02<00:00, 271kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788k/788k [00:02<00:00, 285kB/s]
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.04s/obj]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 618k/618k [00:04<00:00, 153kB/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.69s/obj]
^CProcess ForkPoolWorker-47:
Process ForkPoolWorker-46:
Process ForkPoolWorker-36:
Process ForkPoolWorker-38:██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.04s/obj]
Process ForkPoolWorker-37:
Process ForkPoolWorker-45:
Process ForkPoolWorker-39:
Process ForkPoolWorker-43:
Process ForkPoolWorker-33:
Process ForkPoolWorker-18:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 365, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process ForkPoolWorker-20:
Process ForkPoolWorker-44:
Process ForkPoolWorker-22:
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 197, in _single_map_nested
    return function(data_struct)
  File "/usr/lib/python3.10/site-packages/datasets/utils/download_manager.py", line 217, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 561, in get_from_cache
    response = http_head(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 476, in http_head
    response = _request_with_retry(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 405, in _request_with_retry
    response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
  File "/usr/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
KeyboardInterrupt
#1:   0%|                                                                                                                                                                                           | 0/2 [03:00<?, ?obj/s]
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 197, in _single_map_nested
    return function(data_struct)
  File "/usr/lib/python3.10/site-packages/datasets/utils/download_manager.py", line 217, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 659, in get_from_cache
    http_get(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 442, in http_get
    response = _request_with_retry(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 405, in _request_with_retry
    response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
  File "/usr/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 197, in _single_map_nested
    return function(data_struct)
  File "/usr/lib/python3.10/site-packages/datasets/utils/download_manager.py", line 217, in _download
    return cached_path(url_or_filename, download_config=download_config)
KeyboardInterrupt
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 561, in get_from_cache
    response = http_head(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 476, in http_head
    response = _request_with_retry(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 405, in _request_with_retry
    response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
  File "/usr/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
KeyboardInterrupt
#3:   0%|                                                                                                                                                                                           | 0/2 [03:00<?, ?obj/s]
#11:   0%|                                                                                                                                                                                          | 0/1 [00:49<?, ?obj/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 197, in _single_map_nested
    return function(data_struct)
  File "/usr/lib/python3.10/site-packages/datasets/utils/download_manager.py", line 217, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 561, in get_from_cache
    response = http_head(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 476, in http_head
    response = _request_with_retry(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 405, in _request_with_retry
    response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
  File "/usr/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 723, in send
    history = [resp for resp in gen]
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 723, in <listcomp>
    history = [resp for resp in gen]
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 266, in resolve_redirects
    resp = self.send(
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
KeyboardInterrupt
#5:   0%|                                                                                                                                                                                           | 0/1 [03:00<?, ?obj/s]
KeyboardInterrupt

Process ForkPoolWorker-42:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 215, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True)) for v in pbar]
  File "/usr/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 197, in _single_map_nested
    return function(data_struct)
  File "/usr/lib/python3.10/site-packages/datasets/utils/download_manager.py", line 217, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 561, in get_from_cache
    response = http_head(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 476, in http_head
    response = _request_with_retry(
  File "/usr/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 405, in _request_with_retry
    response = requests.request(method=method.upper(), url=url, timeout=timeout, **params)
  File "/usr/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
KeyboardInterrupt
#9:   0%|                                                                                                                                                                                           | 0/1 [00:51<?, ?obj/s]

Steps to reproduce the bug

"""Kodak.

Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import datasets

NUMBER = 17
_DESCRIPTION = """\
The pictures below link to lossless, true color (24 bits per pixel, aka "full
color") images. It is my understanding they have been released by the Eastman
Kodak Company for unrestricted usage. Many sites use them as a standard test
suite for compression testing, etc. Prior to this site, they were only
available in the Sun Raster format via ftp. This meant that the images could
not be previewed before downloading. Since their release, however, the lossless
PNG format has been incorporated into all the major browsers. Since PNG
supports 24-bit lossless color (which GIF and JPEG do not), it is now possible
to offer this browser-friendly access to the images.
"""
_HOMEPAGE = "https://r0k.us/graphics/kodak/"
_LICENSE = "GPLv3"
_URLS = [
    f"https://github.com/MohamedBakrAli/Kodak-Lossless-True-Color-Image-Suite/raw/master/PhotoCD_PCD0992/{i}.png"
    for i in range(1, 1 + NUMBER)
]

class Kodak(datasets.GeneratorBasedBuilder):
    """Kodak datasets."""

    VERSION = datasets.Version("0.0.1")

    def _info(self):
        features = datasets.Features(
            {
                "image": datasets.Image(),
            }
        )
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
            homepage=_HOMEPAGE,
            license=_LICENSE,
        )

    def _split_generators(self, dl_manager):
        """Return SplitGenerators."""
        file_paths = dl_manager.download_and_extract(_URLS)
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
                    "file_paths": file_paths,
                },
            ),
        ]

    def _generate_examples(self, file_paths):
        """Yield examples."""
        for file_path in file_paths:
            yield file_path, {"image": file_path}

Expected behavior

When len(_URLS) < 16, it works.

In [3]: dataset = load_dataset('Freed-Wu/kodak', split='test')
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.53k/2.53k [00:00<00:00, 3.02MB/s]
[11/19/22 22:04:28] WARNING  Using custom data configuration default                                                                                                                                         builder.py:379
Downloading and preparing dataset kodak/default to /home/wzy/.cache/huggingface/datasets/Freed-Wu___kodak/default/0.0.1/d26017602a592b5bfa7e008127cdf9dec5af220c9068005f1b4eda036031f475...
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 593k/593k [00:00<00:00, 2.88MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 621k/621k [00:03<00:00, 166kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 531k/531k [00:01<00:00, 366kB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:13<00:00,  1.18it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 3832.38it/s]
Dataset kodak downloaded and prepared to /home/wzy/.cache/huggingface/datasets/Freed-Wu___kodak/default/0.0.1/d26017602a592b5bfa7e008127cdf9dec5af220c9068005f1b4eda036031f475. Subsequent calls will reuse this data.

Environment info

Freed-Wu commented 1 year ago

It can fix the bug temporarily.

from datasets import DownloadConfig
config = DownloadConfig(num_proc=8)
In [5]: dataset = load_dataset('Freed-Wu/kodak', split='test', download_config=config)
Downloading and preparing dataset kodak/default to /home/wzy/.cache/huggingface/datasets/Freed-Wu___kodak/default/0.0.1/6cf51f2b3d686d24a33fe86945f9e16802def212325f9345cf3cbb1b9f5f4a57...
Downloading data files #4: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.39obj/s]
Downloading data files #2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.38obj/s]
Downloading data files #3: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.13obj/s]
Downloading data files #7: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.09obj/s]
Downloading data files #5: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.08obj/s]
Downloading data files #0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.08obj/s]
Downloading data files #1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.36s/obj]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 492k/492k [00:01<00:00, 253kB/s]
Downloading data files #6: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00,  4.63s/obj]
Extracting data files #0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1407.17obj/s]
Extracting data files #1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1325.91obj/s]
Extracting data files #3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1524.46obj/s]
Extracting data files #2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1404.66obj/s]
Extracting data files #4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1538.63obj/s]
Extracting data files #6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1711.73obj/s]
Extracting data files #7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2144.33obj/s]
Extracting data files #5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1964.85obj/s]
Dataset kodak downloaded and prepared to /home/wzy/.cache/huggingface/datasets/Freed-Wu___kodak/default/0.0.1/6cf51f2b3d686d24a33fe86945f9e16802def212325f9345cf3cbb1b9f5f4a57. Subsequent calls will reuse this data.
lhoestq commented 1 year ago

Thanks for reporting ! This sounds like an issue with python multiprocessing. If we switch to multithreading for the downloads it should be much more robust - let me know if this is something you'd like to contribute, I'd be happy to help and give you some pointers

Freed-Wu commented 1 year ago

an issue with python multiprocessing

If it is an issue with multiprocessing, should we report it to upstream?

lhoestq commented 1 year ago

Debugging this would require quite some work in my opinion, and I've often failed to make reproducible examples, since it's pretty correlated to one's environment + hardware. So I wouldn't spend too much time on this unless we manage to reproduce this on another machine consistently.

Instead I'd encourage a more pragmatic fix that is: not create tons of processes (on regular machines it may slow things down anyway), and instead use multithreading by default.

Freed-Wu commented 1 year ago

I am not expert of python. I hear about python has GIL, which result in multi processing is worse than multi threading. So I am not sure if this change makes sense?

And if this is a bug of multi processing, why not report to upstream and let them fix? And even if change it to multi threading, how can we make sure it can truly fix this problem?

Freed-Wu commented 1 year ago

Just my 2c. No offense.

lhoestq commented 1 year ago

Just my 2c. No offense.

sure np ^^

I hear about python has GIL, which result in multi processing is worse than multi threading. So I am not sure if this change makes sense?

Here the bottleneck speed is the bandwidth used to download the files. When downloading, the GIL is released, so multithreading gives the same speed as multiprocessing.

And if this is a bug of multi processing, why not report to upstream and let them fix?

Usually to fix a bug it's important to be able to reproduce it. This way you can share it, experiment with it, and then make sure it's fixed. Here I'm afraid it's not easy to reproduce. Though I think that spawning too many processes for your machine can lead to this kind of issues.

And even if change it to multi threading, how can we make sure it can truly fix this problem?

Multithreading is more robust in python because IIRC there are less locks involved which are often the cause of code hanging for no reason.