optuna / optuna-integration

Extended functionalities for Optuna in combination with third-party libraries.
https://optuna-integration.readthedocs.io/en/latest/index.html
MIT License
37 stars 30 forks source link

PyTorchLightningPruningCallback messes with Multiworker Dataloaders #154

Open mspils opened 1 year ago

mspils commented 1 year ago

Expected behavior

When using the PyTorchLightningPruningCallback a pruned trial should resolve without errors.

Environment

Error messages, stack traces, or logs

[I 2023-10-11 17:40:48,579] Trial 5 finished with value: 0.08356545865535736 and parameters: {'learning_rate': 0.0020429196484991327, 'n_layers': 1, 'n_units_l0': 4}. Best is trial 4 with value: 0.078277587890625.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 49    
-------------------------------------
49        Trainable params
0         Non-trainable params
49        Total params
0.000     Total estimated model params size (MB)
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 237.30it/s, v_num=61[I 2023-10-11 17:40:49,436] Trial 6 pruned. Trial was pruned at epoch 1.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1104.35it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 16    
-------------------------------------
16        Trainable params
0         Non-trainable params
16        Total params
0.000     Total estimated model params size (MB)
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 205.02it/s, v_num=62[I 2023-10-11 17:40:50,429] Trial 7 pruned. Trial was pruned at epoch 1.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 965.54it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 13    
-------------------------------------
13        Trainable params
0         Non-trainable params
13        Total params
0.000     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20.64it/s, v_num=62]
Epoch 0:   0%|                                                                                                                                                                                                                                                     | 0/10 [00:00<?, ?it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>                                                                                                                                                                                                 
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.75it/s, v_num=62]
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
                                                                                                                                                                                                                                                                                             assert self._parent_pid == os.getpid(), 'can only test a child process'                                                                                                                                                                                                                
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.75it/s, v_num=62]
                                                                                                                                                                                                                                                                                             self._shutdown_workers()                                                                                                                                                                                                                                                               
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.74it/s, v_num=62]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
Traceback (most recent call last):
                                                                                                                                                                                                                                                                                           File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__                                                                                                                                                              
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.73it/s, v_num=62]
                                                                                                                                                                                                                                                                                             self._shutdown_workers()                                                                                                                                                                                                                                                               
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.71it/s, v_num=62]
                                                                                                                                                                                                                                                                                         Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>                                                                                                                                                                                                  
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.69it/s, v_num=62]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
                                                                                                                                                                                                                                                                                         Traceback (most recent call last):                                                                                                                                                                                                                                                         
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fb7eb82d820>
    if w.is_alive():
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
    self._shutdown_workers()
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
    if w.is_alive():
  File "/home/mspils/miniconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.67it/s, v_num=62]
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11.66it/s, v_num=62]
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 214.93it/s, v_num=63[I 2023-10-11 17:40:51,374] Trial 8 pruned. Trial was pruned at epoch 1.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 576.38it/s]

Steps to reproduce

  1. Run the following code (Maybe change DEVICES and ACCELERATOR if you do not have multiple GPUs.)
  2. Wait until a trial is pruned.
    
    from typing import List, Optional

import pytorch_lightning as pl

import lightning.pytorch as pl

import optuna import torch from lightning.pytorch.callbacks import Callback from optuna.integration import PyTorchLightningPruningCallback from torch import nn, optim from torch.utils.data import DataLoader

torch.set_float32_matmul_precision('high') BATCHSIZE = 1024 EPOCHS = 50 ACCELERATOR = 'cuda' DEVICES = [1]

class OptunaPruningCallback(PyTorchLightningPruningCallback, Callback): """Custom optuna Pruning Callback, because CUDA/Lightning do not play well with the default one.

Args:
    PyTorchLightningPruningCallback (_type_): _description_
    pl (_type_): _description_
"""

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

class ToyDataSet(torch.utils.data.Dataset): def init(self, count): super(ToyDataSet).init() self.x = torch.rand(count,dtype=torch.float32) self.y = torch.rand(count,dtype=torch.float32) self.count = count

def __len__(self) -> int:
    return self.count

def __getitem__(self, idx):
    if idx >= len(self):
        raise IndexError(f"Index {idx} is out of range, dataset has length {len(self)}")

    return self.x[idx], self.y[idx]

class LightningNet(pl.LightningModule): def init(self, output_dims,learning_rate) -> None: super().init() layers = [] input_dim = 1 for output_dim in output_dims: layers.append(nn.Linear(input_dim, output_dim)) layers.append(nn.ReLU()) input_dim = output_dim layers.append(nn.Linear(input_dim, 1))

    self.model = nn.Sequential(*layers)
    self.save_hyperparameters()

def forward(self, data: torch.Tensor) -> torch.Tensor:
    return self.model(data)

def training_step(self, batch: List[torch.Tensor], batch_idx: int) -> torch.Tensor:
    x,y = batch
    x = x.view(-1,1)
    y_hat = self(x)[:,0]
    loss = nn.functional.mse_loss(y_hat,y)
    self.log("train_loss", loss)
    return loss

def validation_step(self, batch: List[torch.Tensor], batch_idx: int) -> None:
    x,y = batch
    x = x.view(-1,1)
    y_hat = self(x)[:,0]
    val_loss = nn.functional.mse_loss(y_hat,y)
    self.log("val_loss", val_loss, sync_dist=True)

def configure_optimizers(self) -> optim.Optimizer:
    return optim.Adam(self.model.parameters(),self.hparams.learning_rate)

def setup(self, stage: Optional[str] = None) -> None:
    self.dataset_train = ToyDataSet(10000)
    self.dataset_val = ToyDataSet(1000)

def train_dataloader(self) -> DataLoader:
    return DataLoader(self.dataset_train, batch_size=BATCHSIZE, shuffle=True, pin_memory=True,num_workers=8,persistent_workers=True)

def val_dataloader(self) -> DataLoader:
    return DataLoader(self.dataset_val, batch_size=BATCHSIZE, shuffle=False, pin_memory=True,num_workers=8,persistent_workers=True)

def objective(trial: optuna.trial.Trial) -> float: learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True) n_layers = trial.suggest_int("n_layers", 1, 2) output_dims = [trial.suggest_int(f"n_units_l{i}", 4, 64, log=True) for i in range(n_layers)]

model = LightningNet(output_dims,learning_rate)

trainer = pl.Trainer(
    logger=True,
    enable_checkpointing=False,
    max_epochs=EPOCHS,
    accelerator=ACCELERATOR,
    devices=DEVICES,
    callbacks=[PyTorchLightningPruningCallback(trial, monitor="val_loss")],
    #callbacks=[OptunaPruningCallback(trial, monitor="val_loss")],
)

trainer.fit(model)

return trainer.callback_metrics["val_loss"].item()

if name == "main": study = optuna.create_study( direction="minimize", pruner= optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3, bootstrap_count=0), load_if_exists=True)

study.optimize(objective, n_trials=100)


### Additional context (optional)

When optimizing a study with optuna, using the PyTorchLightningPruningCallback it is possible for pruned trials to not finish properly.
DataLoaders with multiple workers are not killed properly and possibly even interfere with later trials. At least the logged v_nums are out of order sometimes.
HideakiImamura commented 1 year ago

@mspils Does this problem still occur with the latest Optuna v3.4?

mspils commented 12 months ago

Yes and no. It crashes, which is probably an improvement:


[W 2023-11-21 13:45:48,635] Trial 0 failed with parameters: {'learning_rate': 0.009733867742024538, 'n_layers': 1, 'n_units_l0': 12} because of the following error: RuntimeError('DataLoader worker (pid(s) 3999530) exited unexpectedly').
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3999530) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "optuna_issue.py", line 108, in objective
    trainer.fit(model)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1284, in _get_data
    success, data = self._try_get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3999530) exited unexpectedly
[W 2023-11-21 13:45:48,646] Trial 0 failed with value None.
Traceback (most recent call last):
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3999530) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "optuna_issue.py", line 119, in <module>
    study.optimize(objective, n_trials=100)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/study.py", line 451, in optimize
    _optimize(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 251, in _run_trial
    raise func_err
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "optuna_issue.py", line 108, in objective
    trainer.fit(model)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1284, in _get_data
    success, data = self._try_get_data()
  File "/home/mspils/miniconda3/envs/optuna_test3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3999530) exited unexpectedly
Epoch 0:   0%|          | 0/10 [00:00<?, ?it/s]                                         
youyinnn commented 10 months ago

Same issue here.