Open codingl2k1 opened 1 year ago
HfApi
comes from the huggingface_hub
package. You can use this utility to change the huggingface_hub
's Session
proxies (see the example).
We plan to implement https://github.com/huggingface/datasets/issues/5080 and make this behavior more consistent eventually.
this
Thanks. I will try huggingface_hub.configure_http_backend
to change session's config.
@mariosasko are you saying if I do the following:
def backend_factory() -> requests.Session:
session = requests.Session()
session.proxies = {
"https": "127.0.0.1:8887",
"http": "127.0.0.1:8887",
}
session.verify = "/etc/ssl/certs/ca-certificates.crt"
return session
# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)
which works nicely with transformer library:
def download_gpt_2_model():
tokenizer = GPT2Tokenizer.from_pretrained(
"gpt2", force_download=True, resume_download=False
)
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors="pt")
print(encoded_input)
model = GPT2Model.from_pretrained(
"gpt2", force_download=True, resume_download=False
)
output = model(**encoded_input)
should work for datasets library as well ?
In my case if I just do:
def download_sts12_sts_dataset():
dataset = load_dataset(
"mteb/sts12-sts",
download_mode="force_redownload",
verification_mode="basic_checks",
revision="main",
)
I am getting:
ConnectionError: Couldn't reach https://huggingface.co/datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/mteb/sts12-sts/resolve/main/dataset_infos.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f429e87a3a0>, 'Connection to huggingface.co timed out. (connect timeout=100)'))")))
which is typical when the proxy server is not defined. Looks like what is set in configure_http_backend(backend_factory=backend_factory) is ignore.
If I use env variable instead, it is working
def download_sts12_sts_dataset():
os.environ["https_proxy"] = "127.0.0.1:8887"
os.environ["http_proxy"] = "127.0.0.1:8887"
os.environ["REQUESTS_CA_BUNDLE"] = "/etc/ssl/certs/ca-certificates.crt"
dataset = load_dataset(
"mteb/sts12-sts",
download_mode="force_redownload",
verification_mode="basic_checks",
revision="main",
)
Should I add something ?
I am using huggingface_hub 0.15.1
, datasets 2.13.0
, transformers 4.30.2
huggingface_hub.configure_http_backend
works for transformers
because they only use the huggingface_hub
lib for downloads. Our download logic is a bit more complex (e.g., we also support downloading non-Hub files), so we are not aligned with them yet. In the meantime, it's best to use the env vars.
@mariosasko I fully understand that the logic for dataset is different. I see 2 issues with the current implementation of the env variables:
One of the best way is to able to pass our requests.Session() directly
import openai
session = requests.Session()
session.cert = CERT
session.verify = False
openai.requestssession = session
My 2 cents in this discussion
Describe the bug
But, when getting the dataset_info from HfApi, the http requests not using the proxies.
Steps to reproduce the bug
load_dataset_build
with download_config.Expected behavior
DownloadConfig.proxies works for getting dataset_info.
Environment info
https://github.com/huggingface/datasets/commit/406b2212263c0d33f267e35b917f410ff6b3bc00 Python 3.11.4