DownloadConfig.proxies not work when load_dataset_builder calling HfApi.dataset_info

codingl2k1 commented 1 year ago

Describe the bug

download_config = DownloadConfig(proxies={'https': '<my proxy>'})
builder = load_dataset_builder(..., download_config=download_config)

But, when getting the dataset_info from HfApi, the http requests not using the proxies.

Steps to reproduce the bug

Setup proxies in DownloadConfig.
Call load_dataset_build with download_config.
Inspect the call stack in HfApi.dataset_info.

Expected behavior

DownloadConfig.proxies works for getting dataset_info.

Environment info

https://github.com/huggingface/datasets/commit/406b2212263c0d33f267e35b917f410ff6b3bc00 Python 3.11.4

mariosasko commented 1 year ago

HfApi comes from the huggingface_hub package. You can use this utility to change the huggingface_hub's Session proxies (see the example).

We plan to implement https://github.com/huggingface/datasets/issues/5080 and make this behavior more consistent eventually.

codingl2k1 commented 1 year ago

this

Thanks. I will try huggingface_hub.configure_http_backend to change session's config.

tarrade commented 1 year ago

@mariosasko are you saying if I do the following:

def backend_factory() -> requests.Session:
    session = requests.Session()
    session.proxies = {
        "https": "127.0.0.1:8887",
        "http": "127.0.0.1:8887",
    }
    session.verify = "/etc/ssl/certs/ca-certificates.crt"
    return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

which works nicely with transformer library:

def download_gpt_2_model():
    tokenizer = GPT2Tokenizer.from_pretrained(
        "gpt2", force_download=True, resume_download=False
    )
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors="pt")
    print(encoded_input)

    model = GPT2Model.from_pretrained(
        "gpt2", force_download=True, resume_download=False
    )
    output = model(**encoded_input)

should work for datasets library as well ?

In my case if I just do:

def download_sts12_sts_dataset():
    dataset = load_dataset(
        "mteb/sts12-sts",
        download_mode="force_redownload",
        verification_mode="basic_checks",
        revision="main",
    )

I am getting: ConnectionError: Couldn't reach https://huggingface.co/datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f429e87a3a0>, 'Connection to huggingface.co timed out. (connect timeout=100)'))")))

which is typical when the proxy server is not defined. Looks like what is set in configure_http_backend(backend_factory=backend_factory) is ignore.

If I use env variable instead, it is working

def download_sts12_sts_dataset():

    os.environ["https_proxy"] = "127.0.0.1:8887"
    os.environ["http_proxy"] = "127.0.0.1:8887"
    os.environ["REQUESTS_CA_BUNDLE"] = "/etc/ssl/certs/ca-certificates.crt"

    dataset = load_dataset(
        "mteb/sts12-sts",
        download_mode="force_redownload",
        verification_mode="basic_checks",
        revision="main",
    )

Should I add something ?

I am using huggingface_hub 0.15.1, datasets 2.13.0, transformers 4.30.2

mariosasko commented 1 year ago

huggingface_hub.configure_http_backend works for transformers because they only use the huggingface_hub lib for downloads. Our download logic is a bit more complex (e.g., we also support downloading non-Hub files), so we are not aligned with them yet. In the meantime, it's best to use the env vars.

tarrade commented 12 months ago

@mariosasko I fully understand that the logic for dataset is different. I see 2 issues with the current implementation of the env variables:

having the same https_proxy/http_prox/no_proxy env variables for all tools is not good in some case. For example I have 2 differents proxy server. In 2019 we had discussion with the Tensorflow teams and they recommended to do the following: TFDS_HTTP_PROXY, TFDS_HTTPS_PROXY ...
with recent version of requests, it is not possible to deactivate TLS interception (verify=false) by using env variable. This is useful to debug things and in some case TLS is not working and you need to ignore verifying the SSL certificate (probably not recommended)

One of the best way is to able to pass our requests.Session() directly

import openai
session = requests.Session()
session.cert = CERT
session.verify = False
openai.requestssession = session

My 2 cents in this discussion

huggingface / datasets