[BUG] Lightning can't create new processes on GPU in Azure Databricks

valtterivalo commented 1 year ago

Describe the bug When trying to use gpu as the accelerator on Azure Databricks, Lightning runs into a runtime error: RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

To Reproduce The bug can be reproduced by using any example notebook from Darts documentation by passing pl_trainer_kwargs = {'accelerator': 'gpu'} when the cluster has GPUs available. I'm personally using the N-BEATS example (although in my particular case I'm using N-HiTS - N-BEATS runs into the same error as it is not model specific) , and the following code cell triggers the error:

from torchmetrics import MeanAbsolutePercentageError
from pytorch_lightning.callbacks import EarlyStopping
import sys

print(sys.version)
print(darts.__version__)

# Use mape as the early stopping monitor
torch_metrics = MeanAbsolutePercentageError()

# Early stopping callback for N-HiTS (or other deep learning models)
earlystopper = EarlyStopping(
    monitor='val_MeanAbsolutePercentageError',
    patience=5,
    min_delta=0.05,
    mode='min'
)

pl_trainer_kwargs = {
  'callbacks': [earlystopper],
  'accelerator': 'gpu',
  'devices': -1,
  }

nhits_model = None

nhits_model = NHiTSModel(
            input_chunk_length=12,
            output_chunk_length=1,
            num_stacks=32,
            num_blocks=4,
            num_layers=2,
            n_epochs=100,
            layer_widths=512,
            batch_size=16,
            pooling_kernel_sizes=None,
            n_freq_downsample=None,
            dropout=0.11894711771029268,
            torch_metrics=torch_metrics,
            activation='ReLU',
            MaxPool1d=True,
            force_reset=True,
            pl_trainer_kwargs=pl_trainer_kwargs,
        )

nhits_model.fit(train, val_series=val, verbose=True)

Expected behavior The training process is expected to run like normal.

System (please complete the following information):

Python version: 3.9.5
darts version: 0.24.0

Additional context It seems like a Lightning issue to be fair, not necessarily a Darts one.

solalatus commented 1 year ago

Strange, but it resembles a case I had in multi GPU. See this, might help...

valtterivalo commented 1 year ago

Strange, but it resembles a case I had in multi GPU. See this, might help...

It was indeed an issue with Multi GPU, lacking notebook support is something that has slipped past me in the documentation. One GPU works fine.

Perhaps it's worth adding in the error message a pointer that the user might be trying to run multiple GPUs in a notebook environment?

unit8co / darts

[BUG] Lightning can't create new processes on GPU in Azure Databricks #1743