Open zwsjink opened 1 year ago
@djghosh13
Hi, thanks for bringing this up! I assumed that the HF datasets would work properly without Internet connection because the download_evalsets.py
script loads them once to put them in the cache already. I'll look into potential solutions to this issue
Can you try setting the environment variable HF_DATASETS_OFFLINE
to 1
? (from https://huggingface.co/docs/datasets/v2.14.5/en/loading#offline)
It seems like even if the dataset is cached, HF will by default check the online version. So hopefully this should fix things.
If that doesn't work, could you check to make sure the files are indeed in the hf_cache
folder?
Sorry to get back to you late, but I was able to bypass this issue by modifying the datacomp source code as follows:
diff --git a/eval_utils/retr_eval.py b/eval_utils/retr_eval.py
index 3c19917..647edf7 100644
--- a/eval_utils/retr_eval.py
+++ b/eval_utils/retr_eval.py
@@ -37,7 +37,7 @@ def evaluate_retrieval_dataset(
dataset = RetrievalDataset(
datasets.load_dataset(
- f"nlphuji/{task.replace('retrieval/', '')}",
+ f"/mnt/data/datacomp2023/evaluate_datasets/{task.replace('retrieval/', '')}.py",
split="test",
cache_dir=os.path.join(data_root, "hf_cache")
if data_root is not None
which force the hf to use my local dataset repository instead of checking any online updates.
Well, I first use
to download all the necessary datasets on an internet-accessible machine and then migrate the data to my machine with limited internet access. All the other evaluation went well but the retrieval datasets , which use hf_cache/ directory instead.
The error goes like this :
Seems like the huggingface datasets module is still trying to connect to the internet. Is there any trick I can play to skip the connection to huggingface? The evaluation command :