huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.32k stars 2.7k forks source link

Connection error of the HuggingFace's dataset Hub due to SSLError with proxy #5207

Open leemgs opened 2 years ago

leemgs commented 2 years ago

Describe the bug

It's weird. I could not normally connect the dataset Hub of HuggingFace due to a SSLError in my office. Even when I try to connect using my company's proxy address (e.g., http_proxy and https_proxy), I'm getting the SSLError issue. What should I do to download the datanet stored in HuggingFace normally? I welcome any comments. I think those comments will be helpful to me.

real 0m7.742s user 0m4.930s


### Steps to reproduce the bug

Steps to reproduce this behavior. 

(deepspeed) geunsik-lim@ai02:~/qtlab$ ./test_debian_csrc_dataset.py Traceback (most recent call last): File "/data/home/geunsik-lim/qtlab/./test_debian_csrc_dataset.py", line 6, in dataset = load_dataset("moyix/debian_csrc") File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1719, in load_dataset builder_instance = load_dataset_builder( File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1497, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1222, in dataset_module_factory raise e1 from None File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1179, in dataset_module_factory raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).name})") ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError) (deepspeed) geunsik-lim@ai02:~/qtlab$ (deepspeed) geunsik-lim@ai02:~/qtlab$ (deepspeed) geunsik-lim@ai02:~/qtlab$ (deepspeed) geunsik-lim@ai02:~/qtlab$ cat ./test_debian_csrc_dataset.py

!/usr/bin/env python

from datasets import load_dataset dataset = load_dataset("moyix/debian_csrc")


1.  Adde proxy address of a company in /etc/profile
2.  Download dataset with load_dataset() function of  datasets package that is provided by HuggingFace.
3.  In this case, the address would be "moyix--debian_csrc".
4.  I get the "`ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError`)" error message.

### Expected behavior

* error message:
ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError)

### Environment info

* software version information:

(deepspeed) geunsik-lim@ai02:~$ (deepspeed) geunsik-lim@ai02:~$ conda list -f pytorch

packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:

#

Name Version Build Channel

pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch (deepspeed) geunsik-lim@ai02:~$ conda list -f python

packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:

#

Name Version Build Channel

python 3.10.6 haa1d7c7_1 (deepspeed) geunsik-lim@ai02:~$ conda list -f datasets

packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:

#

Name Version Build Channel

datasets 2.6.1 py_0 huggingface (deepspeed) geunsik-lim@ai02:~$ uname -a Linux ai02 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux (deepspeed) geunsik-lim@ai02:~$ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

lhoestq commented 2 years ago

Hi ! It looks like an issue with your python environment, can you make sure you're able to run GET requests to https://huggingface.co using requests in python ?

leemgs commented 2 years ago

Thanks for your reply. Does this mean that I have to use the do_datasetfunction and the requestsfunction to download the dataset from the company's proxy environment?

Or just clone the dataset repo

git lfs install git clone https://huggingface.co/datasets/moyix/debian_csrc

if you want to clone without large files – just their pointers

prepend your git clone with the following env var:

GIT_LFS_SKIP_SMUDGE=1

lhoestq commented 2 years ago

You can use requests to see if downloading a file from the Hugging Face Hub works. If so, then datasets should work as well. If not, then you have to find another way using an internet connection that works

leemgs commented 1 year ago

I resolved this issue by applying to "unblock websites" at https://huggingface.com in a corporate network environment with a firewall.

lonngxiang commented 1 year ago

Hi ! It looks like an issue with your python environment, can you make sure you're able to run GET requests to https://huggingface.co using requests in python ?

yes,but still not work

image image

kuikuikuizzZ commented 12 months ago

I read https://github.com/huggingface/datasets/blob/main/src/datasets/load.py, it fail when get the dataset metadata, so download_config has not worked.

            hf_api = HfApi(config.HF_ENDPOINT)
            try:
                dataset_info = hf_api.dataset_info(
                    repo_id=path,
                    revision=revision,
                    token=download_config.token,
                    timeout=100.0,
                )
            except Exception as e:  # noqa catch any exception of hf_hub and consider that the dataset doesn't exist
                if isinstance(
                    e,
                    (
                        OfflineModeIsEnabled,
                        requests.exceptions.ConnectTimeout,
                        requests.exceptions.ConnectionError,
                    ),
                ):
                    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")

I configure the huggingface_hub api, use configure_http_backend

from huggingface_hub import configure_http_backend
def backend_factory() -> requests.Session:
    session = requests.Session()
    session.proxies = proxy
    session.verify = False
    return session

configure_http_backend(backend_factory=backend_factory)

It works.

DataScientistTX commented 10 months ago

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

NoviceStone commented 9 months ago

@kuikuikuizzZ Could you please explain where the configuration code is added?

mahdibaghbanzadeh commented 8 months ago

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

Worked for as well! I faced the issue while submitting jobs through SLURM.

Joeland4 commented 6 months ago

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

doesn't work , what does this code mean?

marcv12 commented 4 months ago

If you're working on a cluster, may be that they disabled remote connections for security purposes, you will have to download the files on your local machine and then transfer them to your cluster through scp or some other transfer protocol. I know you've probably resolved the issue, but that is for anyone in the future who might stumble across this thread and needs help because I struggled with that even after reading this thread.

shafferjohn commented 4 months ago

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

If this not work, try this:

export http_proxy="http://127.0.0.1:10810"
export https_proxy="http://127.0.0.1:10810"
git config --global http.proxy http://127.0.0.1:10810
git config --global https.proxy http://127.0.0.1:10810

jupyter notebook

set your proxy env first, then start notebook in this session

Joeland4 commented 4 months ago

If you're working on a cluster, may be that they disabled remote connections for security purposes, you will have to download the files on your local machine and then transfer them to your cluster through scp or some other transfer protocol. I know you've probably resolved the issue, but that is for anyone in the future who might stumble across this thread and needs help because I struggled with that even after reading this thread.

Thank you buddy!