Open kelvinAI opened 2 years ago
Hi ! I didn't managed to reproduce the issue. When you kill the process, is there any stacktrace that shows at what point in the code python is hanging ?
Hi @lhoestq , I've traced the issue back to file locking. It's similar to this thread, using Lustre filesystem as well. https://github.com/huggingface/datasets/issues/329 . In this case the user was able to modify and add -o flock option while mounting and it solved the problem.
However in other cases such as mine, we do not have the permissions to modify the commands while mounting. I'm still trying to figure out a workaround. Any ideas how can we use a mounted Lustre filesystem with no flock option?
Hi @kelvinAI , I've had this issue on our institution's system which uses Lustre (in addition to our compute nodes being siloed off from external network access). The workaround I made for downloading/loading datasets was to set the $HFHOME
environment variable to a location on the node's local storage (SSD), effectively a location that gets cleared regularly and sometimes gets used for temporary or cached files which is pretty common, e.g. "scratch" storage. Maybe your sysadmins, if you have them, could point you to subdirectories on a node that aren't linked to the Lustre filesystem. After downloading to scratch I found that the transformers, modules, and metrics cached folders were fine to move to my user drives on the Lustre filesystem but cached datasets that had fingerprints still had some issues with filelock, so it would help to use the function my_dataset.save_to_disk('path/on/lustre_fs')
and static class function Dataset.load_from_disk('path/on/lustre_fs')
. In rough steps:
ds = datasets.load_dataset(dataset_name)
ds.save_to_disk(my_path_on_lustre)
with a path in your user space on the Lustre filesystemfrom datasets import Dataset; new_ds = Dataset.load_from_disk(my_path_on_lustre)
Obviously this hinges on there existing scratch storage on the nodes you're using. Fingers crossed.
Hi @jpmcd , thanks for sharing your experience. For my case, the Lustre filesystem (with more storage space) is the scratch storage like the one you've mentioned. We have a local storage for each user but unfortunately there's not enough space in it to 'cache' huge datasets, hence that is why I tried changing HF_HOME to point to the scratch disk with more space and encountered the flock issue. Unfortunately I'm not aware of any viable solution to this for now so I simply fall back to using torch dataset.
@jpmcd your comment saved me from pulling my hair out in frustration. Setting HF_HOME
to a directory that's not on Lustre works like a charm. ✨
Describe the bug
Dataset loads indefinitely after modifying cache path (~/.cache/huggingface) If none of the environment variables are set, this custom dataset loads fine ( json-based dataset with custom dataset load script) ** Update: Transformer modules faces the same issue as well during loading
A clear and concise description of what the bug is.
Issue:
There's no error nor any logs thrown so I'm out of ideas of how to to debug this. The custom dataset works fine if the default ~/.cache dir is used, but unfortunately it's out of space and we do not have permissions to modify the disk.
Steps to reproduce the bug
What I've tried:
Expected results
Datasets should load / cache as usual with the only exception that cache directory is different
Actual results
Any actions taken above to change the cache directory results in loading indefinitely without terminating.
Environment info
transformers
version: 4.18.0.dev0