unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.1k stars 881 forks source link

[BUG] semaphore or lock released too many times #1939

Open jacktang opened 1 year ago

jacktang commented 1 year ago

Describe the bug

I am learning darts and optuna hyperparameter optimization from the guide: https://unit8co.github.io/darts/userguide/hyperparameter_optimization.html#hyperparameter-optimization-with-optuna. I trained the model using GPU and 4 workers, got the error:

Metric val_loss improved by 0.010 >= min_delta = 0.001. New best score: 0.651
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [02:20<00:00, 46.78s/it, train_loss=1.830]
                                                                                                                               Exception in thread QueueFeederThread:
Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 271, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times

Exception ignored in: <function _ConnectionBase.__del__ at 0x7fc62620c820>
Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 132, in __del__
    self._close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor
GPU available: True (cuda), used: True

To Reproduce The code is from https://unit8co.github.io/darts/userguide/hyperparameter_optimization.html#hyperparameter-optimization-with-optuna

Expected behavior No error

System (please complete the following information):

madtoinou commented 1 year ago

Hi @jacktang, thank your for writing.

Can you please indicate which cell of the notebook raise the error? It seems like it comes from one of darts dependencies...

Also, can you try upgrading to darts 0.25.0?

jacktang commented 1 year ago

OK. I upgraded to 0.25.0, and converted the code to python code. But the error still exists. The OS is Ubuntu 20.04.4 LTS

Best value: 29.46530282497406, Best params: {'kernel_size': 3, 'num_filters': 4, 'weight_norm': False, 'dilation_base': 2, 'dropout': 0.017801282281381472, 'lr': 8.169771024932909e-05, 'year': False}
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type              | Params
----------------------------------------------------
0 | criterion     | MSELoss           | 0
1 | train_metrics | MetricCollection  | 0
2 | val_metrics   | MetricCollection  | 0
3 | dropout       | MonteCarloDropout | 0
4 | res_blocks    | ModuleList        | 166
----------------------------------------------------
166       Trainable params
0         Non-trainable params
166       Total params
0.001     Total estimated model params size (MB)
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.49it/s, train_loss=8.210[I 2023-08-08 17:42:53,616] Trial 16 pruned. Trial was pruned at epoch 0.█████████████████████████| 1/1 [00:00<00:00, 652.30it/s]
Current value: 5.6125102043151855, Current params: {'kernel_size': 3, 'num_filters': 3, 'weight_norm': False, 'dilation_base': 2, 'dropout': 0.10989051943366332, 'lr': 0.0008949513735868809, 'year': False}
Best value: 29.46530282497406, Best params: {'kernel_size': 3, 'num_filters': 4, 'weight_norm': False, 'dilation_base': 2, 'dropout': 0.017801282281381472, 'lr': 8.169771024932909e-05, 'year': False}
Epoch 1: 100%|██████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.10it/s, train_loss=1.000, val_loss=0.859]
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.07s/it, train_loss=8.210]
                                                                                                                               Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 271, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type              | Params
----------------------------------------------------
0 | criterion     | MSELoss           | 0
1 | train_metrics | MetricCollection  | 0
2 | val_metrics   | MetricCollection  | 0
3 | dropout       | MonteCarloDropout | 0
4 | res_blocks    | ModuleList        | 68
----------------------------------------------------
68        Trainable params
0         Non-trainable params
68        Total params
0.000     Total estimated model params size (MB)
Epoch 7: 100%|██████████████████████████████████████████████████| 3/3 [01:04<00:00, 21.54s/it, train_loss=0.794, val_loss=1.220]
                                                                                                                               Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
    reader_close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/dev/miniconda3/envs/pf/lib/python3.10/multiprocessing/queues.py", line 271, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [01:32<00:00, 30.92s/it, train_loss=1.360]
Epoch 7: 100%|██████████████████████████████████████████████████| 3/3 [01:15<00:00, 25.29s/it, train_loss=0.903, val_loss=0.999]
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 48.06it/s, train_loss=1.370[I 2023-08-08 17:44:09,206] Trial 17 pruned. Trial was pruned at epoch 0.█████████████████████████| 1/1 [00:00<00:00, 638.21it/s]
Current value: 1.354859471321106, Current params: {'kernel_size': 4, 'num_filters': 2, 'weight_norm': False, 'dilation_base': 3, 'dropout': 0.045057036646966524, 'lr': 7.765323102891736e-05, 'year': False}
Best value: 29.46530282497406, Best params: {'kernel_size': 3, 'num_filters': 4, 'weight_norm': False, 'dilation_base': 2, 'dropout': 0.017801282281381472, 'lr': 8.169771024932909e-05, 'year': False}
WinstonPrivacy commented 1 month ago

I'm encountering a similar error but it appears to only occur when debugging in pycharm. No issues when training using cpu.