I'm facing an issue when calling .compute in distributed multi-node setting.
The symptoms are the same as in huggingface/datasets#4420 , however I'm not sure the cause is the same (the code modification did not solve the issue, but I didn't try lockf)
Environment
evaluate v0.4.0
PyTorch v2.0.1
2 nodes of 8 GPUs each (16 processes), with shared file system
script called with torchrun from SLURM
I get the following errors:
For processes on the first node (0-7, here 4 is reported):
[node1:4]:ValueError: Couldn't acquire lock on /.hf_cache/metrics/accuracy/default/exp-16-rdv.lock from process 4.
For processes on the second node (8-15, here 15 is reported):
[node2:7]:ValueError: Expected to find locked file /.hf_cache/metrics/accuracy/default/exp-16-0.arrow.lock from process 15 but it doesn't exist.
All metrics are loaded with the same experiment_id, and with the correct num_process and process_id arguments.
And of course all the files are present in the cache directory.
I really don't know how to further debug / solve this bug. Do you have any clue ?
It working properly in distributed mono-node setting.
The lock mechanism is based on the filelock package. Right now evaluate requires a filesystem compatible with filelock. But feel free to ask filelock authors and community if there's a way to make it work on your filesystem.
Hi, does anyone have any active workaround for this issue? This still seems to be happening in a multinode distributed setting inspite of the latest releases and the workarounds mentioned here :(
Hello,
I'm facing an issue when calling
.compute
in distributed multi-node setting. The symptoms are the same as in huggingface/datasets#4420 , however I'm not sure the cause is the same (the code modification did not solve the issue, but I didn't try lockf)Environment
torchrun
from SLURMI get the following errors:
For processes on the first node (0-7, here 4 is reported):
For processes on the second node (8-15, here 15 is reported):
All metrics are loaded with the same
experiment_id
, and with the correctnum_process
andprocess_id
arguments. And of course all the files are present in the cache directory.I really don't know how to further debug / solve this bug. Do you have any clue ? It working properly in distributed mono-node setting.
cc @lhoestq