Closed torotoki closed 1 month ago
Hi ! yes now datasets
uses huggingface_hub
to download and cache files from the HF Hub so you need to use HF_HOME
(or manually HF_HUB_CACHE
and HF_DATASETS_CACHE
if you want to separate HF Hub cached files and cached datasets Arrow files)
So in your change I guess it needs to be HF_HOME
instead of HF_CACHE
?
Thank you for your comment. You are right. I am sorry for my mistake, I fixed it.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I just had this issue, and needed to move the setting the env code in the python file to top, before the import of the lib ie.
import os
LOCAL_DISK_MOUNT = '/mnt/data'
os.environ['HF_HOME'] = f'{LOCAL_DISK_MOUNT}/hf_cache/'
os.environ['HF_DATASETS_CACHE'] = f'{LOCAL_DISK_MOUNT}/datasets/'
from datasets import load_dataset
from datasets import load_dataset_builder
from psutil._common import bytes2human
Resolve #6256. As far as I tested,
HF_DATASETS_CACHE
was ignored and I could not specify the cache directory at all except for the default one by this environment variable.HF_HOME
has worked. Perhaps the recent change on file downloading byhuggingface_hub
could affect this bug.In my testing, I could not specify the cache directory even by
load_dataset("dataset_name" cache_dir="...")
. It might be another issue. I also welcome any advice to solve this issue.