Open 2catycm opened 1 year ago
"Please specify an experiment_id to avoid collision between distributed evaluation module instances." I can give unique id to these experiements. But what is the way in evaluate to give "experiment_id " to metric instance. Here is a guide on this "experiment_id" api: The documentation has the way to specify the id. doc
But this issue did not end with this solution. Because it seems to use exp_id, you should somehow create a folder in hf/cache
0%| | 0/1 [00:00<?, ?it/s][ATraceback (most recent call last):
File "/home/researches/hugging_face/hugging_face/src/train_phase2.py", line 239, in <module>
train(
File "/home/researches/hugging_face/hugging_face/src/commons.py", line 137, in train
train_results = trainer.train()
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
return function(batch_size, *args, **kwargs)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate
output = eval_loop(
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 3222, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/home/researches/hugging_face/hugging_face/src/commons.py", line 109, in compute_metrics
return metric.compute(predictions=predictions, references=eval_pred.label_ids)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 432, in compute
self.add_batch(**inputs)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 481, in add_batch
self._init_writer()
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 605, in _init_writer
cache_file_name, filelock = self._create_cache_file() # get ready
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 268, in _create_cache_file
filelock = FileLock(file_path + ".lock")
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 399, in __init__
max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
FileNotFoundError: [Errno 2] No such file or directory: '/home/.cache/huggingface/metrics/accuracy/default/../runs/phase2/swin-base-patch4-window7-224-in22k-finetuned-cifar10-finetune-cifar100_coarse_exp1'
Exception ignored in: <function BaseFileLock.__del__ at 0x14f52e4be290>
Traceback (most recent call last):
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 328, in __del__
self.release(force=True)
File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 304, in release
with self._thread_lock:
AttributeError: 'UnixFileLock' object has no attribute '_thread_lock'
0%| | 5/5000 [00:12<3:22:03, 2.43s/it]
Same here.
But this issue did not end with this solution. Because it seems to use exp_id, you should somehow create a folder in hf/cache
0%| | 0/1 [00:00<?, ?it/s]�[ATraceback (most recent call last): File "/home/researches/hugging_face/hugging_face/src/train_phase2.py", line 239, in <module> train( File "/home/researches/hugging_face/hugging_face/src/commons.py", line 137, in train train_results = trainer.train() File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator return function(batch_size, *args, **kwargs) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate output = eval_loop( File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 3222, in evaluation_loop metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) File "/home/researches/hugging_face/hugging_face/src/commons.py", line 109, in compute_metrics return metric.compute(predictions=predictions, references=eval_pred.label_ids) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 432, in compute self.add_batch(**inputs) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 481, in add_batch self._init_writer() File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 605, in _init_writer cache_file_name, filelock = self._create_cache_file() # get ready File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 268, in _create_cache_file filelock = FileLock(file_path + ".lock") File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 399, in __init__ max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax FileNotFoundError: [Errno 2] No such file or directory: '/home/.cache/huggingface/metrics/accuracy/default/../runs/phase2/swin-base-patch4-window7-224-in22k-finetuned-cifar10-finetune-cifar100_coarse_exp1' Exception ignored in: <function BaseFileLock.__del__ at 0x14f52e4be290> Traceback (most recent call last): File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 328, in __del__ self.release(force=True) File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 304, in release with self._thread_lock: AttributeError: 'UnixFileLock' object has no attribute '_thread_lock' 0%| | 5/5000 [00:12<3:22:03, 2.43s/it]
To solve this problem, we should locate where the problem originates. We can make a minimum-reproducible-similar code. so I made this test_metric_concurrent.py
import numpy as np
import evaluate
#%%
import random
def test(exp_id=0):
metric = evaluate.load("../pretrains/accuracy",
# num_process=1,
# experiment_id=f"oh-my-exp{exp_id}" # setting one: uncomment this line
# experiment_id=f"oh-my-exp{0}" # setting two: uncomment this line
# process_id=exp_id # setting three: uncomment this line
) # setting four: comment all above
expect = random.random()
num = 1000
one_num = int(num*expect)
times = 100
for i in range(times):
acc = metric.compute(predictions=[1]*num, references=[1]*one_num+[0]*(num-one_num))
print(f"acc_{exp_id} = {acc}")
# %%
import multiprocessing
num_p = 20
for i in range(num_p):
p = multiprocessing.Process(target=test, args=(i,))
p.start()
for i in range(num_p):
p.join()
Unfortunately, we find that only # setting three fails to run. all other settings run successfully without bug. which is very confusing.
From my expectation, # setting four should fail because it is exactly the issue I encounter in the bigger code I am running.
I have a similar problem to #481. I am running 10 independent experiments together on different GPU but same file system, which is implemented by LSF (similar to SLURM). And then
metric.compute
have exceptions because multi experiment processes sharing the same code but different parameters, conflicts.The code that I use evaluate is like:
The error is as follows:
The problem is on this code
evaluate/module.py
line 392