huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

DownloadConfig.proxies not work when load_dataset_builder calling HfApi.dataset_info #6032

Open codingl2k1 opened 1 year ago

codingl2k1 commented 1 year ago

Describe the bug

download_config = DownloadConfig(proxies={'https': '<my proxy>'})
builder = load_dataset_builder(..., download_config=download_config)

But, when getting the dataset_info from HfApi, the http requests not using the proxies.

Steps to reproduce the bug

  1. Setup proxies in DownloadConfig.
  2. Call load_dataset_build with download_config.
  3. Inspect the call stack in HfApi.dataset_info.

image

Expected behavior

DownloadConfig.proxies works for getting dataset_info.

Environment info

https://github.com/huggingface/datasets/commit/406b2212263c0d33f267e35b917f410ff6b3bc00 Python 3.11.4

mariosasko commented 1 year ago

HfApi comes from the huggingface_hub package. You can use this utility to change the huggingface_hub's Session proxies (see the example).

We plan to implement https://github.com/huggingface/datasets/issues/5080 and make this behavior more consistent eventually.

codingl2k1 commented 1 year ago

this

Thanks. I will try huggingface_hub.configure_http_backend to change session's config.

tarrade commented 1 year ago

@mariosasko are you saying if I do the following:

def backend_factory() -> requests.Session:
    session = requests.Session()
    session.proxies = {
        "https": "127.0.0.1:8887",
        "http": "127.0.0.1:8887",
    }
    session.verify = "/etc/ssl/certs/ca-certificates.crt"
    return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

which works nicely with transformer library:

def download_gpt_2_model():
    tokenizer = GPT2Tokenizer.from_pretrained(
        "gpt2", force_download=True, resume_download=False
    )
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors="pt")
    print(encoded_input)

    model = GPT2Model.from_pretrained(
        "gpt2", force_download=True, resume_download=False
    )
    output = model(**encoded_input)

should work for datasets library as well ?

In my case if I just do:

def download_sts12_sts_dataset():
    dataset = load_dataset(
        "mteb/sts12-sts",
        download_mode="force_redownload",
        verification_mode="basic_checks",
        revision="main",
    )

I am getting: ConnectionError: Couldn't reach https://huggingface.co/datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f429e87a3a0>, 'Connection to huggingface.co timed out. (connect timeout=100)'))")))

which is typical when the proxy server is not defined. Looks like what is set in configure_http_backend(backend_factory=backend_factory) is ignore.

If I use env variable instead, it is working

def download_sts12_sts_dataset():

    os.environ["https_proxy"] = "127.0.0.1:8887"
    os.environ["http_proxy"] = "127.0.0.1:8887"
    os.environ["REQUESTS_CA_BUNDLE"] = "/etc/ssl/certs/ca-certificates.crt"

    dataset = load_dataset(
        "mteb/sts12-sts",
        download_mode="force_redownload",
        verification_mode="basic_checks",
        revision="main",
    )

Should I add something ?

I am using huggingface_hub 0.15.1, datasets 2.13.0, transformers 4.30.2

mariosasko commented 1 year ago

huggingface_hub.configure_http_backend works for transformers because they only use the huggingface_hub lib for downloads. Our download logic is a bit more complex (e.g., we also support downloading non-Hub files), so we are not aligned with them yet. In the meantime, it's best to use the env vars.

tarrade commented 12 months ago

@mariosasko I fully understand that the logic for dataset is different. I see 2 issues with the current implementation of the env variables:

One of the best way is to able to pass our requests.Session() directly

import openai
session = requests.Session()
session.cert = CERT
session.verify = False
openai.requestssession = session

My 2 cents in this discussion