huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Dataset loads indefinitely after modifying default cache path (~/.cache/huggingface) #3986

Open kelvinAI opened 2 years ago

kelvinAI commented 2 years ago

Describe the bug

Dataset loads indefinitely after modifying cache path (~/.cache/huggingface) If none of the environment variables are set, this custom dataset loads fine ( json-based dataset with custom dataset load script) ** Update: Transformer modules faces the same issue as well during loading

A clear and concise description of what the bug is.

Issue:

custom_cache_dir  
  | -- modules  
            | -- __init__.py  
            | -- datasets_modules  
                      | -- __init__.py  
                      | -- datasets  
                             | -- __init__.py  
                             | -- script.py (Dataset loading script)  
                             | -- script.lock   

There's no error nor any logs thrown so I'm out of ideas of how to to debug this. The custom dataset works fine if the default ~/.cache dir is used, but unfortunately it's out of space and we do not have permissions to modify the disk.

Steps to reproduce the bug

What I've tried:

Expected results

Datasets should load / cache as usual with the only exception that cache directory is different

Actual results

Any actions taken above to change the cache directory results in loading indefinitely without terminating.

Environment info

lhoestq commented 2 years ago

Hi ! I didn't managed to reproduce the issue. When you kill the process, is there any stacktrace that shows at what point in the code python is hanging ?

kelvinAI commented 2 years ago

Hi @lhoestq , I've traced the issue back to file locking. It's similar to this thread, using Lustre filesystem as well. https://github.com/huggingface/datasets/issues/329 . In this case the user was able to modify and add -o flock option while mounting and it solved the problem.
However in other cases such as mine, we do not have the permissions to modify the commands while mounting. I'm still trying to figure out a workaround. Any ideas how can we use a mounted Lustre filesystem with no flock option?

jpmcd commented 2 years ago

Hi @kelvinAI , I've had this issue on our institution's system which uses Lustre (in addition to our compute nodes being siloed off from external network access). The workaround I made for downloading/loading datasets was to set the $HFHOME environment variable to a location on the node's local storage (SSD), effectively a location that gets cleared regularly and sometimes gets used for temporary or cached files which is pretty common, e.g. "scratch" storage. Maybe your sysadmins, if you have them, could point you to subdirectories on a node that aren't linked to the Lustre filesystem. After downloading to scratch I found that the transformers, modules, and metrics cached folders were fine to move to my user drives on the Lustre filesystem but cached datasets that had fingerprints still had some issues with filelock, so it would help to use the function my_dataset.save_to_disk('path/on/lustre_fs') and static class function Dataset.load_from_disk('path/on/lustre_fs'). In rough steps:

  1. Initially download to scratch storage with ds = datasets.load_dataset(dataset_name)
  2. Call ds.save_to_disk(my_path_on_lustre) with a path in your user space on the Lustre filesystem
  3. Load datasets with from datasets import Dataset; new_ds = Dataset.load_from_disk(my_path_on_lustre)

Obviously this hinges on there existing scratch storage on the nodes you're using. Fingers crossed.

kelvinAI commented 2 years ago

Hi @jpmcd , thanks for sharing your experience. For my case, the Lustre filesystem (with more storage space) is the scratch storage like the one you've mentioned. We have a local storage for each user but unfortunately there's not enough space in it to 'cache' huge datasets, hence that is why I tried changing HF_HOME to point to the scratch disk with more space and encountered the flock issue. Unfortunately I'm not aware of any viable solution to this for now so I simply fall back to using torch dataset.

minaremeli commented 1 year ago

@jpmcd your comment saved me from pulling my hair out in frustration. Setting HF_HOME to a directory that's not on Lustre works like a charm. ✨