plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
11.68k stars 390 forks source link

Scalene error: received signal SIGSEGV when using PyTorch on ROCm #481

Open Bengt opened 1 year ago

Bengt commented 1 year ago

Describe the bug When I run my training code written in PyTorch running on ROCm on an AMD GPU, I get an ominous error:

$ scalene training.py
Scalene error: received signal SIGSEGV 

When I run the same code with only CPU profiling, the error disappears:

$ scalene --cpu-only training.py

To Reproduce

Since my training code is rather large, I cannot with reasonable effort provide a minimal working example. However, note that simple PyTorch code actually works fine:

from torch import Tensor
from torch import rand

def pytorch_iterating_random_tensor():
    # Arrange
    dimension_0: int = 3
    dimension_1: int = 2

    # Act
    tensor: Tensor = rand(
        dimension_0,
        dimension_1,
    )

    # Assert
    assert isinstance(tensor, Tensor)
    for dimension_0_index in range(dimension_0):
        for dimension_1_index in range(dimension_1):
            assert 0 <= tensor[dimension_0_index][dimension_1_index] <= 1

if __name__ == '__main__':
    pytorch_iterating_random_tensor()

Expected behavior

I would have expected Scalene to run on a more complex PyTorch application, just like on the trivial application.

Desktop (please complete the following information):

Additional context

I first see some of my prints and then the SegFault, so it seems likely that the initialization of ROCm/OpenML causes the issue in Scalene.

vmkalbskopf commented 1 year ago

According to the README, I believe only Nvidia GPUs are supported for profiling.