Dataset loads indefinitely after modifying default cache path (~/.cache/huggingface)

kelvinAI commented 2 years ago

Describe the bug

Dataset loads indefinitely after modifying cache path (~/.cache/huggingface) If none of the environment variables are set, this custom dataset loads fine ( json-based dataset with custom dataset load script) ** Update: Transformer modules faces the same issue as well during loading

A clear and concise description of what the bug is.

Issue:

Dataset loading stalls / freezes indefinitely when HF_HOME is changed to a custom directory
No error code, had to terminate the process
There are some files created in the cache directory:

custom_cache_dir  
  | -- modules  
            | -- __init__.py  
            | -- datasets_modules  
                      | -- __init__.py  
                      | -- datasets  
                             | -- __init__.py  
                             | -- script.py (Dataset loading script)  
                             | -- script.lock

There's no error nor any logs thrown so I'm out of ideas of how to to debug this. The custom dataset works fine if the default ~/.cache dir is used, but unfortunately it's out of space and we do not have permissions to modify the disk.

Steps to reproduce the bug

What I've tried:

Modifying HF_HOME (https://github.com/huggingface/transformers/issues/8703)
Modifying HF_DATASETS_CACHE (https://huggingface.co/docs/datasets/v1.12.0/cache.html)

Modifying cache_dir param during runtime

>>> from datasets import load_dataset
>>> dataset = load_dataset('test_dataset', cache_dir='/path/to/new/cache')

Disabling dataset cache

>>> from datasets import set_caching_enabled
>>> set_caching_enabled(False)

Expected results

Datasets should load / cache as usual with the only exception that cache directory is different

Actual results

Any actions taken above to change the cache directory results in loading indefinitely without terminating.

Environment info

transformers version: 4.18.0.dev0
Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10
Python version: 3.8.8
Huggingface_hub version: 0.4.0
PyTorch version (GPU?): 1.8.1+cu102 (True)
Tensorflow version (GPU?): 2.4.1 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

lhoestq commented 2 years ago

Hi ! I didn't managed to reproduce the issue. When you kill the process, is there any stacktrace that shows at what point in the code python is hanging ?

kelvinAI commented 2 years ago

Hi @lhoestq , I've traced the issue back to file locking. It's similar to this thread, using Lustre filesystem as well. https://github.com/huggingface/datasets/issues/329 . In this case the user was able to modify and add -o flock option while mounting and it solved the problem.
However in other cases such as mine, we do not have the permissions to modify the commands while mounting. I'm still trying to figure out a workaround. Any ideas how can we use a mounted Lustre filesystem with no flock option?

jpmcd commented 2 years ago

Hi @kelvinAI , I've had this issue on our institution's system which uses Lustre (in addition to our compute nodes being siloed off from external network access). The workaround I made for downloading/loading datasets was to set the $HFHOME environment variable to a location on the node's local storage (SSD), effectively a location that gets cleared regularly and sometimes gets used for temporary or cached files which is pretty common, e.g. "scratch" storage. Maybe your sysadmins, if you have them, could point you to subdirectories on a node that aren't linked to the Lustre filesystem. After downloading to scratch I found that the transformers, modules, and metrics cached folders were fine to move to my user drives on the Lustre filesystem but cached datasets that had fingerprints still had some issues with filelock, so it would help to use the function my_dataset.save_to_disk('path/on/lustre_fs') and static class function Dataset.load_from_disk('path/on/lustre_fs'). In rough steps:

Initially download to scratch storage with ds = datasets.load_dataset(dataset_name)
Call ds.save_to_disk(my_path_on_lustre) with a path in your user space on the Lustre filesystem
Load datasets with from datasets import Dataset; new_ds = Dataset.load_from_disk(my_path_on_lustre)

Obviously this hinges on there existing scratch storage on the nodes you're using. Fingers crossed.

kelvinAI commented 2 years ago

Hi @jpmcd , thanks for sharing your experience. For my case, the Lustre filesystem (with more storage space) is the scratch storage like the one you've mentioned. We have a local storage for each user but unfortunately there's not enough space in it to 'cache' huge datasets, hence that is why I tried changing HF_HOME to point to the scratch disk with more space and encountered the flock issue. Unfortunately I'm not aware of any viable solution to this for now so I simply fall back to using torch dataset.

minaremeli commented 1 year ago

@jpmcd your comment saved me from pulling my hair out in frustration. Setting HF_HOME to a directory that's not on Lustre works like a charm. ✨

huggingface / datasets