Open PANXIONG-CN opened 1 year ago
strategy: "auto"
with Multiple GPUsI made an update to the GPU configuration part of the code, setting the strategy
to "auto" as shown below:
pl_trainer_kwargs = {"accelerator": "gpu", "strategy": "auto" }
Unfortunately, I still encountered the same RuntimeError
:
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
The relevant code with the new configuration is:
...
my_model = TransformerModel(
...
pl_trainer_kwargs = {"accelerator": "gpu", "strategy": "auto" },
)
my_model.fit(series=train_scaled, val_series=val_scaled, verbose=True)
...
I expected the "auto" strategy to adapt to the available GPUs and run without the RuntimeError.
Again, using a single GPU works without any problems. It seems that when trying to leverage multiple GPUs, whether using "ddp" or "auto" strategy, the same error arises.
TFTModel
I tried a different approach and used TFTModel
from the darts library. Surprisingly, when implementing this model and configuration, I was able to successfully utilize multiple GPUs without encountering the previous RuntimeError
.
Here's the code that worked:
import torch
from darts.models import TFTModel
from darts.datasets import AirPassengersDataset
if __name__ == "__main__":
torch.multiprocessing.freeze_support()
series = AirPassengersDataset().load()
series = [series] * 100
model = TFTModel(
input_chunk_length=12,
output_chunk_length=6,
add_relative_index=True,
pl_trainer_kwargs={"accelerator": "gpu", "devices": "auto"}
)
model.fit(series, epochs=10)
preds = model.predict(n=6, series=series, num_samples=100)
print("len(preds)",len(preds), "len(series)", len(series))
This leads me to believe that the issue might be specific to the TransformerModel
when running on multiple GPUs, given that TFTModel
works fine in a similar setup.
It would be great to get some insights into why TransformerModel
struggles with multi-GPU configurations while TFTModel
operates without issues.
Describe the bug
When I attempt to run the
TransformerModel
using multiple GPUs, I encounter the following error:To Reproduce
Below is the code snippet to reproduce the issue:
Expected behavior
I expect the model to run using multiple GPUs without any issues.
System (please complete the following information):
Additional context
The model runs correctly when using a single GPU.