aman1b commented 1 month ago

Hi community,

I have been stuck on this issue for some time now and would greatly appreciate any help! I am trying to run the optimise_hyperparameter function over 2 A100GPU using PyTorch DDP strategy.

When I run this I get the following error: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 160 params, while rank 1 has inconsistent 137 params.

I have tried setting the seed across ranks but no luck. Has anyone experiences this issue or have an example of using this function and training a TFT with DDP?

I am using the latest package versions and training on an Azure VM. The application is run once I trigger the train_model function.

def prepare_data(data_prep_folder):

# Load in training and validation dataset
training = torch.load(f"{data_prep_folder}/{constants.TRAIN_DATASET_FILE_NAME}")
validation = torch.load(f"{data_prep_folder}/{constants.VALIDATION_DATASET_FILE_NAME}")

logger.info(f"Training set loaded with {len(training)} length.")
logger.info(f"Validation set loaded with {len(validation)} length.")

# Create dataloaders
train_dataloader = training.to_dataloader(
    train=True,
    batch_size=128,
    num_workers=47,
    pin_memory=True
)

val_dataloader = validation.to_dataloader(
    train=False,
    batch_size=128,
    num_workers=47,
    pin_memory=True
)

logger.info(f"Dataloaders created with 128 batch size and 47 workers.")
return train_dataloader, val_dataloader

def hyperparameter_tuner(train_dataloader, val_dataloader, model_train_folder):

Start time

start_time = time.time()
logger.info("Starting hyperparameter tuning...")

# Create study
study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path=model_train_folder,
    n_trials=2,
    max_epochs=30,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(
        accelerator='gpu', 
        strategy=DDPStrategy(),
        devices='auto', 
        limit_train_batches=10
    ),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False
)

logger.info("Hyperparameter tuning finished.")

# Get best parameters
best_params = study.best_trial.params

logger.info(f"Best trial parameters: {best_params}")

training_time = time.time() - start_time
hours, remainder = divmod(training_time, 3600)
minutes, seconds = divmod(remainder, 60)

logger.info(f"Tuning took {int(hours)} hours, {int(minutes)} minutes, and {int(seconds)} seconds.")

return best_params

aman1b commented 1 month ago

Can anyone help here? How can I use DDP with the optimize_hyperparameter function?

fkiraly commented 2 weeks ago

Potentially related to the windows failures reported here: https://github.com/jdb78/pytorch-forecasting/issues/1623

Can you kindly paste the full output of pip list, from your python environment, and also let us know what your operating system and python version are?

sktime / pytorch-forecasting

[BUG] Issue with using optimise_hyperparameter with PyTorch DDP #1588

Start time