pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Kaggle Notebooks: TPU detected but wont use #7805

Open MichaelSchroter opened 3 months ago

MichaelSchroter commented 3 months ago

❓ Questions and Help

Hi All, I Have this code

import optuna
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Assuming dataset is already defined
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

def objective(trial):
    device = xm.xla_device()
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-2, log=True)
    dropout_prob = trial.suggest_float('dropout_prob', 0.2, 0.7)
    batch_size = trial.suggest_int('batch_size', 2, 32)
    optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'SGD'])
    loss_fn_name = trial.suggest_categorical('loss_fn', ['DiceLoss', 'FocalLoss', 'CombinedLoss', 'BCEWithLogitsLoss'])

    backbone = "resnet101"
    model_name = "DeepLabV3Plus"
    model = create_model(model_name, encoder_name=backbone, in_channels=3, classes=1)
    model.to(device)

    if optimizer_name == 'Adam':
        optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=0.0001)
    elif optimizer_name == 'SGD':
        optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0001)

    if loss_fn_name == 'DiceLoss':
        loss_fn = DiceLoss()
    elif loss_fn_name == 'FocalLoss':
        loss_fn = FocalLoss()
    elif loss_fn_name == 'CombinedLoss':
        loss_fn = CombinedLoss()
    elif loss_fn_name == 'BCEWithLogitsLoss':
        pos_weight = torch.tensor([1.127], device=device)
        loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

    for module in model.modules():
        if isinstance(module, nn.Conv2d):
            module.add_module('dropout', nn.Dropout2d(dropout_prob))

    scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.1)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    num_epochs = 5
    best_loss = float('inf')

    for epoch in range(num_epochs):
        model.train()
        train_losses = []
        para_loader = pl.ParallelLoader(train_loader, [device])
        for inputs, targets in tqdm(para_loader.per_device_loader(device), desc=f"Epoch {epoch+1}/{num_epochs} - Training"):
            inputs, targets = inputs.to(device), targets.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, targets.float())
            loss.backward()
            xm.optimizer_step(optimizer)
            train_losses.append(loss.item())

        model.eval()
        val_losses = []
        para_loader = pl.ParallelLoader(val_loader, [device])
        with torch.no_grad():
            for inputs, targets in tqdm(para_loader.per_device_loader(device), desc=f"Epoch {epoch+1}/{num_epochs} - Validation"):
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = loss_fn(outputs, targets.float())
                val_losses.append(loss.item())

        val_loss = np.mean(val_losses)
        scheduler.step(val_loss)

        if val_loss < best_loss:
            best_loss = val_loss

    return best_loss

# Save the study to a persistent storage
study_name = "my_study"
storage_name = f"sqlite:///example.db"
study = optuna.create_study(direction='minimize', study_name=study_name, storage=storage_name, load_if_exists=True)
study.optimize(objective, n_trials=15)

# Print the best hyperparameters
print('Best trial:')
trial = study.best_trial
print(f'  Value: {trial.value}')
print('  Params: ')
for key, value in trial.params.items():
    print(f'    {key}: {value}')

However the even though the TPU is detected as Using device: xla:0 It does not show in dashboard, and the TPU deactivates after while due to not been used. Would anyone be able to help me with this matter please . Thanks & Best Regards AMJS

JackCaoG commented 3 months ago

@qihqi can you take a look at this one?

qihqi commented 3 months ago

Hi @MichaelSchroter,,

Few clarifications on the repro steps:

  1. Kaggle notebook: if you happens to have the notebook link, would you be able to share that? If not, would you provide the exact steps you did to get the notebook allocated? I created a kaggle notebook (just clicked the google result, never used kaggle before), pasted your snippet, executed, and got the following error:
Cell In[1], line 5
      2 from torch.optim.lr_scheduler import ReduceLROnPlateau
      4 # Assuming dataset is already defined
----> 5 train_size = int(0.8 * len(dataset))
      6 val_size = len(dataset) - train_size
      7 train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

NameError: name 'dataset' is not defined

https://www.kaggle.com/code/scratchpad/notebook8f60daaa78/edit is the link to my notebook.

I guess there is some setup step you did and I did not. Please let me know.

  1. You have mentioned It does not show in dashboard, what dashboard is it exactly? If that is something Kaggle specific I would recommend also filing issues with Kaggle as well.