huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.03k stars 2.63k forks source link

(Load dataset failure) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py #759

Closed AI678 closed 3 years ago

AI678 commented 3 years ago

Hey, I want to load the cnn-dailymail dataset for fine-tune. I write the code like this from datasets import load_dataset

test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“train”)

And I got the following errors.

Traceback (most recent call last): File “test.py”, line 7, in test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“test”) File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 589, in load_dataset module_path, hash = prepare_module( File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 268, in prepare_module local_path = cached_path(file_path, download_config=download_config) File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 300, in cached_path output_path = get_from_cache( File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 475, in get_from_cache raise ConnectionError(“Couldn’t reach {}”.format(url)) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py

How can I fix this ?

lhoestq commented 3 years ago

Are you running the script on a machine with an internet connection ?

AI678 commented 3 years ago

Yes , I can browse the url through Google Chrome.

lhoestq commented 3 years ago

Does this HEAD request return 200 on your machine ?

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

If it returns 200, could you try again to load the dataset ?

AI678 commented 3 years ago

Thank you very much for your response. When I run

import requests                                                                                                                                                                                                         
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

It returns 200.

And I try again to load the dataset. I got the following errors again.

Traceback (most recent call last): File "", line 1, in File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py", line 608, in load_dataset builder_instance.download_and_prepare( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 475, in download_and_prepare self._download_and_prepare( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\builder.py", line 531, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "C:\Users\666666.cache\huggingface\modules\datasets_modules\datasets\cnn_dailymail\0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602\cnn_dailymail.py", line 253, in _split_generators dl_paths = dl_manager.download_and_extract(_DL_URLS) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 254, in download_and_extract return self.extract(self.download(url_or_urls)) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\download_manager.py", line 175, in download downloaded_path_or_paths = map_nested( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 224, in map_nested mapped = [ File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 225, in _single_map_nested((function, obj, types, None, True)) for obj in tqdm(iterable, disable=disable_tqdm) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\py_utils.py", line 163, in _single_map_nested return function(data_struct) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 300, in cached_path output_path = get_from_cache( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py", line 475, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ

Connection error happened but the url was different.

I add the following code.

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

This didn't return 200 It returned like this:

Traceback (most recent call last): File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 159, in _new_conn conn = connection.create_connection( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection raise err File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection sock.connect(sa) TimeoutError: [WinError 10060]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request self._validate_conn(conn) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn conn.connect() File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 309, in connect conn = self._new_conn() File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 171, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001F6060618E0>: Failed to establish a new connection: [WinError 10060]

lhoestq commented 3 years ago

Is google drive blocked on your network ? For me

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")

returns 200

AI678 commented 3 years ago

I can browse the google drive through google chrome. It's weird. I can download the dataset through google drive manually.

lhoestq commented 3 years ago

Could you try to update requests maybe ? It works with 2.23.0 on my side

AI678 commented 3 years ago

My requests is 2.24.0 . It still can't return 200.

AI678 commented 3 years ago

Is it possible I download the dataset manually from google drive and use it for further test ? How can I do this ? I want to reproduce the model in this link https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16. But I can't download the dataset through load_dataset method . I have tried many times and the connection error always happens .

lhoestq commented 3 years ago

The head request should definitely work, not sure what's going on on your side. If you find a way to make it work, please post it here since other users might encounter the same issue.

If you don't manage to fix it you can use load_dataset on google colab and then save it using dataset.save_to_disk("path/to/dataset"). Then you can download the directory on your machine and do

from datasets import load_from_disk
dataset = load_from_disk("path/to/local/dataset")
smile0925 commented 3 years ago

Hi I want to know if this problem has been solved because I encountered a similar issue. Thanks. train_data = datasets.load_dataset("xsum",split="train") ConnectionError:Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/xsum/xsum.py

lhoestq commented 3 years ago

Hi @smile0925 ! Do you have an internet connection ? Are you using some kind of proxy that may block the access to this file ?

Otherwise you can try to update datasets since we introduced retries for http requests in the 1.2.0 version

pip install --upgrade datasets

Let me know if that helps.

smile0925 commented 3 years ago

Hi @lhoestq Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file. image

ZhengxiangShi commented 3 years ago

Hi @lhoestq Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file. image

I have the same problem, have you solved it? Many thanks

smile0925 commented 3 years ago

Hi @ZhengxiangShi You can first try whether your network can access these files. I need to use VPN to access these files, so I download the files that cannot be accessed to the local in advance, and then use them in the code. Like this, train_data = datasets.load_dataset("xsum.py", split="train")

mikechen66 commented 1 year ago

For Ubuntu 20.04, there are the following feedback.

Google Drive is ok, but raw.githubusercontent.com has a big problem. It seems that the raw github could not match the common urllib3 protocols.

1. Google Drive

import requests

requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
<Response [200]>

2. raw.githubusercontent.com

import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")

........

raise CertificateError( urllib3.util.ssl_match_hostname.CertificateError: hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): ........ raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

........

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

.......

raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))

3. XSUM

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

ConnectionError: Couldn't reach https://raw.githubusercontent.com/EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (SSLError(MaxRetryError('HTTPSConnectionPool(host=\'raw.githubusercontent.com\', port=443): Max retries exceeded with url: /EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (Caused by SSLError(CertificateError("hostname \'raw.githubusercontent.com\' doesn\'t match either of \'default.ssl.fastly.net\', \'fastly.com\', \'.a.ssl.fastly.net\', \'.hosts.fastly.net\', \'.global.ssl.fastly.net\', \'.fastly.com\', \'a.ssl.fastly.net\', \'purge.fastly.net\', \'mirrors.fastly.net\', \'control.fastly.net\', \'tools.fastly.net\'")))')))

The following snippet could not solve the implicit ssl error.

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
lhoestq commented 1 year ago

Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ?

mikechen66 commented 1 year ago

Thank lhoestq fo the quick response.

I solve the big issue with the command line as follows.

1. Open hosts (Ubuntu 20.04)

$ sudo gedit /etc/hosts

2. Add the command line into the hosts

151.101.0.133 raw.githubusercontent.com

3. Save hosts

And then the jupyter notebook can access to the datasets (module) and get the datasets of XSUM with raw.githubusercontent.com.

So it is not users' fault. But most of the suggestions in the web are wrong. Anyway, I solve the problem finally.

By the way, users need to add the other github commnads such as the following.

199.232.69.194 github.global.ssl.fastly.net

Cheers!!!

mikechen66 commented 1 year ago

I use the dataset 2.14.4 that published on Aug 8, 2023.发自我的 iPhone在 2023年9月13日,06:38,Quentin Lhoest @.***> 写道: Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>