Processes are terminated in multi-GPU setting when using multiple models and seeds

Hi, When comparing multiple models and multiple seeds using a nested loop, all processes are terminated when the loop switches from one model class to the next. Does anyone have an idea why? Maybe I'm doing this wrong. Or is this a pytorch-lightning issue?

Error message: Child process with PID 652 terminated with code 1. Forcefully terminating all other processes to avoid zombies

Relevant code snippet:

# ...
from pytorch_lightning import seed_everything
from pytorch_lightning.strategies import DDPStrategy

def create_params(input_chunk_length,
                  output_chunk_length, 
                  quantiles,
                  batch_size,
                  n_epochs,
                  dropout):

        # ...

        pl_trainer_kwargs = {
                         'strategy':DDPStrategy(process_group_backend='gloo', accelerator='gpu'),
                         'devices':4
                         #...
                         }
       # ...

def dl_model_training(df, 
                      seeds, 
                      input_chunk_length,
                      output_chunk_length, 
                      quantiles,
                      batch_size,
                      n_epochs,
                      dropout):

  # Some data processing ...

  for model_arch, model_class in [('NHiTS', NHiTSModel), ('TiDE', TiDEModel), ('TFT', TFTModel)]:       
           for i in seeds: 
              # Set the seed
              seed_everything(i, workers=True)

              # Define the model name with seed
              model_arch_seed = f'{model_arch}_gws_{i}'

              # Train the model
              model = model_class(
                      **create_params(
                          input_chunk_length,
                          output_chunk_length, 
                          quantiles,
                          batch_size,
                          n_epochs,
                          dropout
                      ), 
                      model_name=model_arch_seed,
                      work_dir=os.path.join(MODEL_PATH, model_arch)
                  )

              # Fit the model
              model.fit(
                        series=train_gws, 
                        past_covariates=train_cov,
                        future_covariates=train_cov if model_arch in ['TFT', 'TiDE'] else None,
                        val_series=val_gws, 
                        val_past_covariates=val_cov,
                        val_future_covariates=val_cov if model_arch in ['TFT', 'TiDE'] else None,
                        verbose=True
                      ) 

              # Clean up to prevent memory issues
              del model
              gc.collect()
              torch.cuda.empty_cache() 

if __name__ == '__main__':
     torch.multiprocessing.freeze_support()
     dl_model_training(df=gws_bb_subset, 
                       seeds=seeds,
                       input_chunk_length=52,
                       output_chunk_length=16, 
                       quantiles=None, 
                       batch_size=4096,
                       n_epochs=10,
                       dropout=0.2)

Hi @KunzstBGR,

This issue seems to be come from PytorchLightning and not Darts.

It might also arise from the fact that you use multi-gpu. Can you check if it persists when you use devices=[0]?

Have you tried to change num_nodes parameters of DDP? (based on pytorch doc)

Also, is it normal that you don't save checkpoints or generate any kind of forecasts in your code snippet?

Hi @madtoinou , thanks for your quick response!

Multi-GPU: It works with one gpu. After some testing I realized that it has to do with the nested loop. If I switch the order of the for loops and move the seed_everything statement up, the processes do not terminate during the switch from one model class to the next in the multi-GPU setting. Honestly, I can't quite wrap my head around why this works, but I'm glad it does:
```
for i in seeds:
     # Set the seed
     seed_everything(i, workers=True)

     for model_arch, model_class in [('TiDE', TiDEModel), ('NHiTS', NHiTSModel)]: 
```
Nodes: If I set num_nodes higher than 1, the whole process get's stuck (I guess my gpus are all on one node? I'm not too familiar with these things)

Checkpoints: I enabled checkpointing, here's the full code for the model parameters:

def create_params(input_chunk_length,
              output_chunk_length, 
              quantiles,
              batch_size,
              n_epochs,
              dropout):

    # Add metrics for evaluation
    torch_metrics = MetricCollection(
         [MeanSquaredError(), MeanAbsoluteError(), MeanAbsolutePercentageError()]
         )

    # Early stopping
    early_stopper = EarlyStopping(
         monitor='val_loss',
         patience=10,
         min_delta=0.001,
         mode='min'
     )

    lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
    lr_scheduler_kwargs = {'gamma': 0.999}
    lr_logger = LearningRateMonitor(logging_interval='step')  # log the learning rate ('step' or 'epoch')

    pl_trainer_kwargs = {
                     'strategy':DDPStrategy(process_group_backend='gloo', accelerator='gpu'),
                     'devices':4, 
                     'val_check_interval':0.5,
                     'log_every_n_steps':10,
                     'enable_model_summary':True,
                     'enable_checkpointing':True,
                     'callbacks':[early_stopper, lr_logger], 
                     'gradient_clip_val':1, 
                     'num_nodes':1}

    return {
        'input_chunk_length':input_chunk_length,  # lookback window
        'output_chunk_length':output_chunk_length,  # forecast/lookahead window
        'use_reversible_instance_norm':True,
        'pl_trainer_kwargs':pl_trainer_kwargs,
        'likelihood':None, 
        'loss_fn':torch.nn.MSELoss(),
        'save_checkpoints': True,  # checkpoint to retrieve the best performing model state,
        'force_reset':True, # previously existing models with the same name will be reset (& checkpoints will be discarded)
        'batch_size':batch_size,
        'n_epochs':n_epochs,
        'dropout':dropout,
        'log_tensorboard':True,
        'torch_metrics':torch_metrics,
        'lr_scheduler_cls':lr_scheduler_cls,
        'lr_scheduler_kwargs':lr_scheduler_kwargs
    }

Forecasts: I thought it is not recommended to do training and evaluation in one script when using multiple GPUs. Uneven inputs are not supported and the distributed sampler will influence the metrics: https://github.com/Lightning-AI/pytorch-lightning/issues/8375. (one could use torch.distributed.destroy_process_group() to switch to one gpu though). Thus, I have a separate script for creating model forecasts, which is suboptimal because the data preprocessing has to be repeated (e.g. scaling). How would you do this?

Nice, I would not be able to tell why swapping the order of the loops fixed it but as long as it works, it's great!

All good if you save the checkpoints and perform evaluation in a separate loop, I was just curious since it was not visible in the code snippet. It's indeed better to do it separately.

If the issue is solved, can you please close it?

unit8co / darts

Processes are terminated in multi-GPU setting when using multiple models and seeds #2519