Closed KunzstBGR closed 1 month ago
Hi @KunzstBGR,
This issue seems to be come from PytorchLightning and not Darts.
It might also arise from the fact that you use multi-gpu. Can you check if it persists when you use devices=[0]
?
Have you tried to change num_nodes
parameters of DDP
? (based on pytorch doc)
Also, is it normal that you don't save checkpoints or generate any kind of forecasts in your code snippet?
Hi @madtoinou , thanks for your quick response!
Multi-GPU: It works with one gpu. After some testing I realized that it has to do with the nested loop. If I switch the order of the for loops and move the seed_everything statement up, the processes do not terminate during the switch from one model class to the next in the multi-GPU setting. Honestly, I can't quite wrap my head around why this works, but I'm glad it does:
for i in seeds:
# Set the seed
seed_everything(i, workers=True)
for model_arch, model_class in [('TiDE', TiDEModel), ('NHiTS', NHiTSModel)]:
Nodes: If I set num_nodes higher than 1, the whole process get's stuck (I guess my gpus are all on one node? I'm not too familiar with these things)
Checkpoints: I enabled checkpointing, here's the full code for the model parameters:
def create_params(input_chunk_length,
output_chunk_length,
quantiles,
batch_size,
n_epochs,
dropout):
# Add metrics for evaluation
torch_metrics = MetricCollection(
[MeanSquaredError(), MeanAbsoluteError(), MeanAbsolutePercentageError()]
)
# Early stopping
early_stopper = EarlyStopping(
monitor='val_loss',
patience=10,
min_delta=0.001,
mode='min'
)
lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
lr_scheduler_kwargs = {'gamma': 0.999}
lr_logger = LearningRateMonitor(logging_interval='step') # log the learning rate ('step' or 'epoch')
pl_trainer_kwargs = {
'strategy':DDPStrategy(process_group_backend='gloo', accelerator='gpu'),
'devices':4,
'val_check_interval':0.5,
'log_every_n_steps':10,
'enable_model_summary':True,
'enable_checkpointing':True,
'callbacks':[early_stopper, lr_logger],
'gradient_clip_val':1,
'num_nodes':1}
return {
'input_chunk_length':input_chunk_length, # lookback window
'output_chunk_length':output_chunk_length, # forecast/lookahead window
'use_reversible_instance_norm':True,
'pl_trainer_kwargs':pl_trainer_kwargs,
'likelihood':None,
'loss_fn':torch.nn.MSELoss(),
'save_checkpoints': True, # checkpoint to retrieve the best performing model state,
'force_reset':True, # previously existing models with the same name will be reset (& checkpoints will be discarded)
'batch_size':batch_size,
'n_epochs':n_epochs,
'dropout':dropout,
'log_tensorboard':True,
'torch_metrics':torch_metrics,
'lr_scheduler_cls':lr_scheduler_cls,
'lr_scheduler_kwargs':lr_scheduler_kwargs
}
Forecasts: I thought it is not recommended to do training and evaluation in one script when using multiple GPUs. Uneven inputs are not supported and the distributed sampler will influence the metrics:
https://github.com/Lightning-AI/pytorch-lightning/issues/8375.
(one could use torch.distributed.destroy_process_group()
to switch to one gpu though). Thus, I have a separate script for creating model forecasts, which is suboptimal because the data preprocessing has to be repeated (e.g. scaling). How would you do this?
Nice, I would not be able to tell why swapping the order of the loops fixed it but as long as it works, it's great!
All good if you save the checkpoints and perform evaluation in a separate loop, I was just curious since it was not visible in the code snippet. It's indeed better to do it separately.
If the issue is solved, can you please close it?
Hi, When comparing multiple models and multiple seeds using a nested loop, all processes are terminated when the loop switches from one model class to the next. Does anyone have an idea why? Maybe I'm doing this wrong. Or is this a pytorch-lightning issue?
Error message: Child process with PID 652 terminated with code 1. Forcefully terminating all other processes to avoid zombies
Relevant code snippet: