huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
2.04k stars 258 forks source link

Concurrent Programs Sharing Evaluate Lib Would raise "ValueError: Error in finalize: another evaluation module instance is already using the local cache file." #489

Open 2catycm opened 1 year ago

2catycm commented 1 year ago

I have a similar problem to #481. I am running 10 independent experiments together on different GPU but same file system, which is implemented by LSF (similar to SLURM). And then metric.compute have exceptions because multi experiment processes sharing the same code but different parameters, conflicts.

The code that I use evaluate is like:

exp_id = os.environ['exp_id'] or 0 # I can have this, but I don't know how to pass this to metric
metric = evaluate.load("accuracy")    
def compute_metrics(eval_pred):
        """Computes accuracy on a batch of predictions"""
        predictions = np.argmax(eval_pred.predictions, axis=1)
        return metric.compute(predictions=predictions, references=eval_pred.label_ids)

The error is as follows:

0%|          | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/researches/hugging_face/hugging_face/src/train_phase2.py", line 239, in <module>
    train(
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 137, in train
    train_results = trainer.train()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate
    output = eval_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 3222, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 109, in compute_metrics
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 433, in compute
    self._finalize()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 392, in _finalize
    raise ValueError(
ValueError: Error in finalize: another evaluation module instance is already using the local cache file. Please specify an experiment_id to avoid collision between distributed evaluation module instances.
 28%|██▊       | 843/3000 [1:05:13<2:46:54,  4.64s/it]

The problem is on this code evaluate/module.py line 392

        elif self.process_id == 0:
            # Let's acquire a lock on each node files to be sure they are finished writing
            file_paths, filelocks = self._get_all_cache_files()

            # Read the predictions and references
            try:
                reader = ArrowReader(path="", info=DatasetInfo(features=self.selected_feature_format))
                self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
            except FileNotFoundError:
                raise ValueError(
                    "Error in finalize: another evaluation module instance is already using the local cache file. "
                    "Please specify an experiment_id to avoid collision between distributed evaluation module instances."
                ) from None

            # Store file paths and locks and we will release/delete them after the computation.
            self.file_paths = file_paths
            self.filelocks = filelocks
2catycm commented 1 year ago

"Please specify an experiment_id to avoid collision between distributed evaluation module instances." I can give unique id to these experiements. But what is the way in evaluate to give "experiment_id " to metric instance. Here is a guide on this "experiment_id" api: The documentation has the way to specify the id. doc

2catycm commented 1 year ago

But this issue did not end with this solution. Because it seems to use exp_id, you should somehow create a folder in hf/cache

  0%|          | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/researches/hugging_face/hugging_face/src/train_phase2.py", line 239, in <module>
    train(
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 137, in train
    train_results = trainer.train()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate
    output = eval_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 3222, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 109, in compute_metrics
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 432, in compute
    self.add_batch(**inputs)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 481, in add_batch
    self._init_writer()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 605, in _init_writer
    cache_file_name, filelock = self._create_cache_file()  # get ready
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 268, in _create_cache_file
    filelock = FileLock(file_path + ".lock")
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 399, in __init__
    max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
FileNotFoundError: [Errno 2] No such file or directory: '/home/.cache/huggingface/metrics/accuracy/default/../runs/phase2/swin-base-patch4-window7-224-in22k-finetuned-cifar10-finetune-cifar100_coarse_exp1'
Exception ignored in: <function BaseFileLock.__del__ at 0x14f52e4be290>
Traceback (most recent call last):
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 328, in __del__
    self.release(force=True)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 304, in release
    with self._thread_lock:
AttributeError: 'UnixFileLock' object has no attribute '_thread_lock'

  0%|          | 5/5000 [00:12<3:22:03,  2.43s/it]
Yupei-Du commented 1 year ago

Same here.

2catycm commented 1 year ago

But this issue did not end with this solution. Because it seems to use exp_id, you should somehow create a folder in hf/cache

  0%|          | 0/1 [00:00<?, ?it/s]�[ATraceback (most recent call last):
  File "/home/researches/hugging_face/hugging_face/src/train_phase2.py", line 239, in <module>
    train(
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 137, in train
    train_results = trainer.train()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/accelerate/utils/memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate
    output = eval_loop(
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/transformers/trainer.py", line 3222, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/researches/hugging_face/hugging_face/src/commons.py", line 109, in compute_metrics
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 432, in compute
    self.add_batch(**inputs)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 481, in add_batch
    self._init_writer()
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 605, in _init_writer
    cache_file_name, filelock = self._create_cache_file()  # get ready
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/evaluate/module.py", line 268, in _create_cache_file
    filelock = FileLock(file_path + ".lock")
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 399, in __init__
    max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
FileNotFoundError: [Errno 2] No such file or directory: '/home/.cache/huggingface/metrics/accuracy/default/../runs/phase2/swin-base-patch4-window7-224-in22k-finetuned-cifar10-finetune-cifar100_coarse_exp1'
Exception ignored in: <function BaseFileLock.__del__ at 0x14f52e4be290>
Traceback (most recent call last):
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 328, in __del__
    self.release(force=True)
  File "/home/miniconda3/envs/hf_ai/lib/python3.10/site-packages/datasets/utils/filelock.py", line 304, in release
    with self._thread_lock:
AttributeError: 'UnixFileLock' object has no attribute '_thread_lock'

  0%|          | 5/5000 [00:12<3:22:03,  2.43s/it]

To solve this problem, we should locate where the problem originates. We can make a minimum-reproducible-similar code. so I made this test_metric_concurrent.py

import numpy as np
import evaluate
#%%
import random
def test(exp_id=0):
    metric = evaluate.load("../pretrains/accuracy", 
                            # num_process=1, 
                            # experiment_id=f"oh-my-exp{exp_id}" # setting one: uncomment this line
                            # experiment_id=f"oh-my-exp{0}" # setting two: uncomment this line
                            #    process_id=exp_id # setting three: uncomment this line
                            ) # setting four: comment all above
    expect = random.random()
    num = 1000
    one_num = int(num*expect)
    times = 100
    for i in range(times):
        acc = metric.compute(predictions=[1]*num, references=[1]*one_num+[0]*(num-one_num))
    print(f"acc_{exp_id} = {acc}")

# %%
import multiprocessing 
num_p = 20
for i in range(num_p):
    p = multiprocessing.Process(target=test, args=(i,))
    p.start()
for i in range(num_p):
    p.join()
2catycm commented 1 year ago

Unfortunately, we find that only # setting three fails to run. all other settings run successfully without bug. which is very confusing.

2catycm commented 1 year ago

From my expectation, # setting four should fail because it is exactly the issue I encounter in the bigger code I am running.