Closed AI678 closed 3 years ago
Are you running the script on a machine with an internet connection ?
Yes , I can browse the url through Google Chrome.
Does this HEAD request return 200 on your machine ?
import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")
If it returns 200, could you try again to load the dataset ?
Thank you very much for your response. When I run
import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")
It returns 200.
And I try again to load the dataset. I got the following errors again.
Traceback (most recent call last):
File "
Connection error happened but the url was different.
I add the following code.
requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
This didn't return 200 It returned like this:
Traceback (most recent call last): File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 159, in _new_conn conn = connection.create_connection( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection raise err File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection sock.connect(sa) TimeoutError: [WinError 10060]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request self._validate_conn(conn) File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn conn.connect() File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 309, in connect conn = self._new_conn() File "C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connection.py", line 171, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001F6060618E0>: Failed to establish a new connection: [WinError 10060]
Is google drive blocked on your network ? For me
requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
returns 200
I can browse the google drive through google chrome. It's weird. I can download the dataset through google drive manually.
Could you try to update requests
maybe ?
It works with 2.23.0 on my side
My requests
is 2.24.0 . It still can't return 200.
Is it possible I download the dataset manually from google drive and use it for further test ? How can I do this ? I want to reproduce the model in this link https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16. But I can't download the dataset through load_dataset method . I have tried many times and the connection error always happens .
The head request should definitely work, not sure what's going on on your side. If you find a way to make it work, please post it here since other users might encounter the same issue.
If you don't manage to fix it you can use load_dataset
on google colab and then save it using dataset.save_to_disk("path/to/dataset")
.
Then you can download the directory on your machine and do
from datasets import load_from_disk
dataset = load_from_disk("path/to/local/dataset")
Hi
I want to know if this problem has been solved because I encountered a similar issue. Thanks.
train_data = datasets.load_dataset("xsum",
split="train")
ConnectionError:Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.1.3/datasets/xsum/xsum.py
Hi @smile0925 ! Do you have an internet connection ? Are you using some kind of proxy that may block the access to this file ?
Otherwise you can try to update datasets
since we introduced retries for http requests in the 1.2.0 version
pip install --upgrade datasets
Let me know if that helps.
Hi @lhoestq Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.
Hi @lhoestq Oh, may be you are right. I find that my server uses some kind of proxy that block the access to this file.
I have the same problem, have you solved it? Many thanks
Hi @ZhengxiangShi
You can first try whether your network can access these files. I need to use VPN to access these files, so I download the files that cannot be accessed to the local in advance, and then use them in the code. Like this,
train_data = datasets.load_dataset("xsum.py", split="train")
For Ubuntu 20.04, there are the following feedback.
Google Drive is ok, but raw.githubusercontent.com has a big problem. It seems that the raw github could not match the common urllib3 protocols.
1. Google Drive
import requests
requests.head("https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ")
<Response [200]>
2. raw.githubusercontent.com
import requests
requests.head("https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py")
........
raise CertificateError( urllib3.util.ssl_match_hostname.CertificateError: hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): ........ raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))
........
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
.......
raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py (Caused by SSLError(CertificateError("hostname 'raw.githubusercontent.com' doesn't match either of 'default.ssl.fastly.net', 'fastly.com', '.a.ssl.fastly.net', '.hosts.fastly.net', '.global.ssl.fastly.net', '.fastly.com', 'a.ssl.fastly.net', 'purge.fastly.net', 'mirrors.fastly.net', 'control.fastly.net', 'tools.fastly.net'")))
3. XSUM
from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")
ConnectionError: Couldn't reach https://raw.githubusercontent.com/EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (SSLError(MaxRetryError('HTTPSConnectionPool(host=\'raw.githubusercontent.com\', port=443): Max retries exceeded with url: /EdinburghNLP/XSum/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json (Caused by SSLError(CertificateError("hostname \'raw.githubusercontent.com\' doesn\'t match either of \'default.ssl.fastly.net\', \'fastly.com\', \'.a.ssl.fastly.net\', \'.hosts.fastly.net\', \'.global.ssl.fastly.net\', \'.fastly.com\', \'a.ssl.fastly.net\', \'purge.fastly.net\', \'mirrors.fastly.net\', \'control.fastly.net\', \'tools.fastly.net\'")))')))
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
Only the oldest versions of datasets
use raw.githubusercontent.com. Can you try updating datasets
?
Thank lhoestq fo the quick response.
I solve the big issue with the command line as follows.
1. Open hosts (Ubuntu 20.04)
$ sudo gedit /etc/hosts
2. Add the command line into the hosts
151.101.0.133 raw.githubusercontent.com
3. Save hosts
And then the jupyter notebook can access to the datasets (module) and get the datasets of XSUM with raw.githubusercontent.com.
So it is not users' fault. But most of the suggestions in the web are wrong. Anyway, I solve the problem finally.
By the way, users need to add the other github commnads such as the following.
199.232.69.194 github.global.ssl.fastly.net
Cheers!!!
I use the dataset 2.14.4 that published on Aug 8, 2023.发自我的 iPhone在 2023年9月13日,06:38,Quentin Lhoest @.***> 写道: Only the oldest versions of datasets use raw.githubusercontent.com. Can you try updating datasets ?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Hey, I want to load the cnn-dailymail dataset for fine-tune. I write the code like this from datasets import load_dataset
test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“train”)
And I got the following errors.
Traceback (most recent call last): File “test.py”, line 7, in test_dataset = load_dataset(“cnn_dailymail”, “3.0.0”, split=“test”) File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 589, in load_dataset module_path, hash = prepare_module( File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\load.py”, line 268, in prepare_module local_path = cached_path(file_path, download_config=download_config) File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 300, in cached_path output_path = get_from_cache( File “C:\Users\666666\AppData\Local\Programs\Python\Python38\lib\site-packages\datasets\utils\file_utils.py”, line 475, in get_from_cache raise ConnectionError(“Couldn’t reach {}”.format(url)) ConnectionError: Couldn’t reach https://raw.githubusercontent.com/huggingface/datasets/1.1.2/datasets/cnn_dailymail/cnn_dailymail.py
How can I fix this ?