huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Fix the environment variable for huggingface cache #7200

Closed torotoki closed 1 month ago

torotoki commented 1 month ago

Resolve #6256. As far as I tested, HF_DATASETS_CACHE was ignored and I could not specify the cache directory at all except for the default one by this environment variable. HF_HOME has worked. Perhaps the recent change on file downloading by huggingface_hub could affect this bug.

In my testing, I could not specify the cache directory even by load_dataset("dataset_name" cache_dir="..."). It might be another issue. I also welcome any advice to solve this issue.

lhoestq commented 1 month ago

Hi ! yes now datasets uses huggingface_hub to download and cache files from the HF Hub so you need to use HF_HOME (or manually HF_HUB_CACHE and HF_DATASETS_CACHE if you want to separate HF Hub cached files and cached datasets Arrow files)

So in your change I guess it needs to be HF_HOME instead of HF_CACHE ?

torotoki commented 1 month ago

Thank you for your comment. You are right. I am sorry for my mistake, I fixed it.

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yukiman76 commented 3 weeks ago

I just had this issue, and needed to move the setting the env code in the python file to top, before the import of the lib ie.

import os
LOCAL_DISK_MOUNT = '/mnt/data'

os.environ['HF_HOME'] = f'{LOCAL_DISK_MOUNT}/hf_cache/'
os.environ['HF_DATASETS_CACHE'] = f'{LOCAL_DISK_MOUNT}/datasets/'

from datasets import load_dataset
from datasets import load_dataset_builder
from psutil._common import bytes2human