Segmentation fault when DataLoader processes are launched after compiling Triton kernels

I am working on a project that involves restructuring a network over different phases of training. Key aspects of this involves calls to custom Triton code, which is compiled and autotuned on the fly as the different phases of training are reached. Occasionally and without clear pattern, this would result in a segmentation fault through a DataLoader when Triton code is compiled, either immediately or when attempting to load the next batch.

I eventually narrowed it down to something very simple. It only seems to depend on compiling Triton code while DataLoaders are used with multiprocessing, and without persistance. Below is code that reproduces the issue consistently. If enable_multiprocessing=True and enable_persistance=False and enable_triton=True, the program crashes when attempting to query the DataLoaders on the second epoch. If enable_multiprocessing=False or enable_persistance=True or enable_triton=False, the program does not crash.

I am surprised this is an issue as it seems like a very common use case. I hope this can be resolved.

import torch
import torch.utils.data
import triton

@triton.jit()
def add_kernel(
    a_pointer,
    b_pointer,
    c_pointer,
    stride_a_batch,
    stride_a_vector,
    stride_b_batch,
    stride_b_vector,
    stride_c_batch,
    stride_c_vector,
    LENGTH_BATCH: triton.language.constexpr,
    LENGTH_VECTOR: triton.language.constexpr,
    BLOCK_SIZE_BATCH: triton.language.constexpr,
    BLOCK_SIZE_VECTOR: triton.language.constexpr,
):
    program_batch = triton.language.program_id(0)
    program_vector = triton.language.program_id(1)

    start_batch = program_batch * BLOCK_SIZE_BATCH
    start_vector = program_vector * BLOCK_SIZE_VECTOR

    offsets_batch = start_batch + triton.language.arange(0, BLOCK_SIZE_BATCH)
    offsets_vector = start_vector + triton.language.arange(0, BLOCK_SIZE_VECTOR)

    mask = (
        offsets_batch[:, None] < LENGTH_BATCH
        and offsets_vector[None, :] < LENGTH_VECTOR
    )

    a_loaded = triton.language.load(
        (
            a_pointer
            + offsets_batch[:, None] * stride_a_batch
            + offsets_vector[None, :] * stride_a_vector
        ),
        mask=mask,
    )
    b_loaded = triton.language.load(
        (
            b_pointer
            + offsets_batch[:, None] * stride_b_batch
            + offsets_vector[None, :] * stride_b_vector
        ),
        mask=mask,
    )

    c_loaded = a_loaded + b_loaded

    triton.language.store(
        (
            c_pointer
            + offsets_batch[:, None] * stride_c_batch
            + offsets_vector[None, :] * stride_c_vector
        ),
        c_loaded,
        mask=mask,
    )

def add(
    a: torch.Tensor,
    b: torch.Tensor,
):
    if not (len(a.shape) == 2 and len(b.shape) == 2):
        raise RuntimeError("add supports only tensors with two axes")
    if not (a.shape == b.shape):
        raise RuntimeError("add supports only tensors with the same shape")
    if not (a.is_cuda and b.is_cuda):
        raise RuntimeError("add supports only tensors on a CUDA device")
    if not (a.device == b.device):
        raise RuntimeError("add supports only tensors on the same CUDA device")
    if not (a.is_contiguous and b.is_contiguous):
        raise RuntimeError("add supports only tensors that are contiguous in memory")
    if not (a.dtype == torch.float32 and b.dtype == torch.float32):
        raise RuntimeError("add supports only tensors that are float32")

    c = torch.empty_like(a)

    LENGTH_BATCH = a.shape[0]
    LENGTH_VECTOR = a.shape[1]

    BLOCK_SIZE_BATCH = 8
    BLOCK_SIZE_VECTOR = 8

    grid = (
        triton.cdiv(LENGTH_BATCH, BLOCK_SIZE_BATCH),
        triton.cdiv(LENGTH_VECTOR, BLOCK_SIZE_VECTOR),
    )

    add_kernel[grid](
        a,
        b,
        c,
        a.stride(0),
        a.stride(1),
        b.stride(0),
        b.stride(1),
        c.stride(0),
        c.stride(1),
        LENGTH_BATCH,
        LENGTH_VECTOR,
        BLOCK_SIZE_BATCH,
        BLOCK_SIZE_VECTOR,
    )

    return c

class DatasetDummy(torch.utils.data.Dataset):
    def __init__(
        self,
    ):
        super().__init__()
        self.dataset_dimensionality = 8192
        self.dataset_size = 128
        self.dataset_generator = torch.Generator()

    def __getitem__(self, index):
        self.dataset_generator.manual_seed(index)

        sample_source = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )
        sample_target = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )

        return sample_source, sample_target

    def __len__(self):
        return self.dataset_size

if __name__ == "__main__":
    enable_multiprocessing = True
    enable_persistance = False
    enable_triton = True

    dataset = DatasetDummy()
    loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )

    for epoch in range(16):
        print(f"Epoch: {epoch:>3}")
        for batch_index, batch in enumerate(loader):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

This gives the following backtrace:

Traceback (most recent call last):
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 74266) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "reproduce.py", line 152, in <module>
    for batch_index, batch in enumerate(loader):
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
    success, data = self._try_get_data()
  File "$REDACTED_VIRTUAL_ENVIRONMENT_PATH/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 74266) exited unexpectedly

Below are the specifications of two systems I used to reproduce the issue:

os:           Ubuntu 20.04.6 LTS x86_6
cpu:          AMD Ryzen 9 5950X
gpu:          NVIDIA RTX A5000
cuda driver:  550.54.15
cuda version: 12.4
python:       3.8.10
torch:        2.3.0
triton:       2.3.0

os:           Ubuntu 22.04.4 LTS
cpu:          AMD EPYC 7742
gpu:          NVIDIA A100-SXM4-40GB
cuda driver:  535.161.08
cuda version: 12.2
python:       3.9.15
torch:        2.3.0
triton:       2.3.0

### Tasks

It seems making the data loaders persistant is not sufficient to prevent segmentation faults with more complex setups. I will keep trying to isolate the issue.

The slightly modified code below mimics a typical setup where two DataLoaders are used to cover a training dataset and a testing dataset. Even with enable_persistance=True, the code crashes when the testing DataLoader is reached.

class DatasetDummy(torch.utils.data.Dataset):
    def __init__(
        self,
        dataset_partition,
    ):
        super().__init__()
        self.dataset_dimensionality = 8192
        self.dataset_size = 128
        self.dataset_generator = torch.Generator()
        self.dataset_partition = dataset_partition

    def __getitem__(self, index):
        if self.dataset_partition:
            self.dataset_generator.manual_seed(index)
        else:
            self.dataset_generator.manual_seed(-1 - index)

        sample_source = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )
        sample_target = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )

        return sample_source, sample_target

    def __len__(self):
        return self.dataset_size

if __name__ == "__main__":
    enable_multiprocessing = True
    enable_persistance = True
    enable_triton = True

    dataset_train = DatasetDummy(
        True
    )
    dataset_test = DatasetDummy(
        False
    )

    loader_train = torch.utils.data.DataLoader(
        dataset_train,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )
    loader_test = torch.utils.data.DataLoader(
        dataset_test,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )

    for epoch in range(16):
        print(f"Epoch: {epoch:>3}")

        print(f"Train...")
        for batch_index, batch in enumerate(loader_train):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

        print(f"Test...")
        for batch_index, batch in enumerate(loader_test):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

It seems the problem really kicks in when DataLoader workers are created.

I would greatly appreciate help on this, as the overhead of disabling multiprocessing makes Triton unusable for my application, yet Triton greatly improves the performance of important computational bottlenecks.

I have yet another version of the code that confirms the issue occurs when DataLoader workers are started after Triton compilation. Setting enable_preload=True and enable_persistance=True launches the workers early and seems to prevent all crashes:

if __name__ == "__main__":
    enable_multiprocessing = True
    enable_persistance = True
    enable_triton = True
    enable_preload = True

    dataset_train = DatasetDummy(
        True
    )
    dataset_test = DatasetDummy(
        False
    )

    loader_train = torch.utils.data.DataLoader(
        dataset_train,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )
    loader_test = torch.utils.data.DataLoader(
        dataset_test,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )

    if enable_preload:
        for batch in loader_train:
            break
        for batch in loader_test:
            break

    for epoch in range(16):
        print(f"Epoch: {epoch:>3}")

        print(f"Train...")
        for batch_index, batch in enumerate(loader_train):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

        print(f"Test...")
        for batch_index, batch in enumerate(loader_test):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

I think I'm running into the same problem. Getting ERROR: Unexpected segmentation fault encountered in worker on multiple workers while training a model that uses Triton kernels. Persistent workers doesn't fix the issue. Setting num_workers=0 prevents the segfault but training is CPU bottlenecked. Crashes always happen at the end of an epoch (presumably when workers are relaunched), but not always on the first epoch.

I pulled a core dump from one of the segfaulted workers.

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x000071f4d42ab393 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78
#2  0x000071f4d425a6c8 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x000071f4d1562e13 in handler_SIGSEGV(int, siginfo_t*, void*) ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#4  <signal handler called>
#5  __pthread_clockjoin_ex (threadid=125287851361984, thread_return=0x0, clockid=0, abstime=0x0, block=true) at pthread_join_common.c:43
#6  0x000071f392131228 in llvm::llvm_thread_join_impl(unsigned long) ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#7  0x000071f3942b0408 in llvm::ThreadPool::~ThreadPool() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#8  0x000071f392e590b9 in mlir::MLIRContextImpl::~MLIRContextImpl() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#9  0x000071f392e52d27 in mlir::MLIRContext::~MLIRContext() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#10 0x000071f390d228da in std::default_delete<mlir::MLIRContext>::operator() (this=<optimized out>, __ptr=0x71ecb11e9db0)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#11 std::default_delete<mlir::MLIRContext>::operator() (__ptr=0x71ecb11e9db0, this=<optimized out>)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#12 std::unique_ptr<mlir::MLIRContext, std::default_delete<mlir::MLIRContext> >::~unique_ptr (this=<optimized out>, 
    __in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:361
#13 pybind11::class_<mlir::MLIRContext>::dealloc (v_h=...) at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/pybind11.h:1880
#14 0x000071f390cd6840 in pybind11::detail::clear_instance (self=0x71f4134bec30)
    at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:424
#15 0x000071f390cd7431 in pybind11::detail::pybind11_object_dealloc (self=0x71f4134bec30)
    at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:457
#16 0x000071f4d48a9ea3 in _Py_Dealloc (op=0x71f2d9e006c0) at Objects/object.c:2390
#17 Py_DECREF (op=0x71f2d9e006c0) at ./Include/object.h:538
#18 _PyObject_ClearInstanceAttributes (self=0x71f418cfeb10) at Objects/dictobject.c:5566
#19 subtype_clear (self=0x71f418cfeb10) at Objects/typeobject.c:1279
#20 0x000071f4d481e8b8 in delete_garbage (tstate=0x71f4d5b48c58 <_PyRuntime+166328>, gcstate=0x71f4d5b2eb60 <_PyRuntime+59584>, 
    collectable=0x7fff5ac9ea10, old=0x71f4d5b2eba8 <_PyRuntime+59656>) at Modules/gcmodule.c:1013

And version info:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Artix Linux (x86_64)
GCC version: (GCC) 14.1.1 20240507
Clang version: 17.0.6
CMake version: version 3.29.3
Libc version: glibc-2.39

Python version: 3.11.8 (main, Feb 25 2024, 04:18:18) [Clang 17.0.6 ] (64-bit runtime)
Python platform: Linux-6.8.9-artix1-2-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 550.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen Threadripper 3960X 24-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             0
Frequency boost:                      enabled
CPU(s) scaling MHz:                   73%
CPU max MHz:                          4568.1641
CPU min MHz:                          2200.0000
BogoMIPS:                             7603.36
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             12 MiB (24 instances)
L3 cache:                             128 MiB (8 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==2.2.4
[pip3] torch==2.3.0
[pip3] torch-dicom==0.1.dev68+g40a15aa
[pip3] torchmetrics==1.4.0
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[pip3] triton-helpers==0.1.dev16+g179af43
[conda] Could not collect

I also met this bug, and created a issue on the pytorch side.

Some workaround methods that may work:

Decorate at least one forward function with torch.compile of the model before the triton kernel called. The more compiled functions there are, the less probabilities workers crashes.
manually break the data loader to avoid the multiprocessing iters reach their ends.

Not sure when and where they work. @TidalPaladin @leademeule

@flishwang thank you for sharing. On my side I have continued using the persistent data loader trick to avoid the crash. It thankfully has worked consistently over the last few weeks. A proper fix would be much appreciated however.

@flishwang I have adopted the manually break early strategy. Since I'm using PyTorch Lightning I don't have easy access to the data loader directly, but setting limit_train_batches=0.95, limit_val_batches=0.95 in pl.Trainer does the trick.

It seems that this issue is not fully mitigated by the early break workaround. When running for a large number of epochs (200+) the error reappears. This is still much longer than I would be able to run without early breaking. For now I have disabled the Triton components of my model and have had no issues since.

I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults

I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults

which commit

This hasn't resolved for me even after the triton 3.0 update. Although with Triton 2.3.1, it used to happen every time.

ERROR: Unexpected segmentation fault encountered in worker. Traceback (most recent call last): File ".../script_train.py", line 73, in main_wrapper() File ".../script_train.py", line 34, in main_wrapper main(data_dir, exp, method, optim, File ".../script_train.py", line 69, in main train(config, dataloader, model, model_path, device) File ".../train.py", line 113, in train loss_train.append(losstrain.item()) File ".../pytorch-2.3.1_cu121_py310_triton/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 787) is killed by signal: Segmentation fault.

@flishwang @leademeule the problem still happens. Does anyone know good workaround for PyTorch Lightning?

triton-lang / triton

Segmentation fault when DataLoader processes are launched after compiling Triton kernels #3864