szc19990412 / TransMIL

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification
325 stars 72 forks source link

Problem when Trainnig with multiple GPUs #46

Open Brucezhuu opened 2 months ago

Brucezhuu commented 2 months ago

Dataset: CamelYon16.

Follow the CLAM's WSI processing solution, and I started trainning process by using "python train.py --stage='train' --config='Camelyon/TransMIL.yaml' --gpus=0 --fold=0". Nothing went wrong. But the problem is ,when I try to train with 3 GPUs, I change nothing but my command (my command is :"python train.py --stage='train' --config='Camelyon/TransMIL.yaml' --gpus=0,1,2 --fold=0" ) I met AttributeError: 'Lookahead' object has no attribute 'base_optimizer'.

The specific Error messege is as following:

Traceback (most recent call last):
  File "train.py", line 91, in <module>
    main(cfg)
  File "train.py", line 70, in main
    trainer.fit(model = model, datamodule = dm)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 514, in fit
    self.dispatch()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 554, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 106, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 159, in new_process
    results = trainer.train_or_test_or_predict()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 564, in train_or_test_or_predict
    results = self.run_train()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in run_train
    self.train_loop.run_training_epoch()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 493, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 655, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1384, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 279, in optimizer_step
    self.precision_plugin.post_optimizer_step(optimizer, opt_idx)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 88, in post_optimizer_step
    self.scaler.step(optimizer)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 333, in step
    retval = optimizer.step(*args, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/data/TransMIL/MyOptimizer/lookahead.py", line 47, in step
    loss = self.base_optimizer.step(closure)
AttributeError: 'Lookahead' object has no attribute 'base_optimizer

How can I solve it ?