Open leademeule opened 6 months ago
It seems making the data loaders persistant is not sufficient to prevent segmentation faults with more complex setups. I will keep trying to isolate the issue.
The slightly modified code below mimics a typical setup where two DataLoaders are used to cover a training dataset and a testing dataset. Even with enable_persistance=True
, the code crashes when the testing DataLoader is reached.
class DatasetDummy(torch.utils.data.Dataset):
def __init__(
self,
dataset_partition,
):
super().__init__()
self.dataset_dimensionality = 8192
self.dataset_size = 128
self.dataset_generator = torch.Generator()
self.dataset_partition = dataset_partition
def __getitem__(self, index):
if self.dataset_partition:
self.dataset_generator.manual_seed(index)
else:
self.dataset_generator.manual_seed(-1 - index)
sample_source = torch.randn(
(self.dataset_dimensionality,), generator=self.dataset_generator
)
sample_target = torch.randn(
(self.dataset_dimensionality,), generator=self.dataset_generator
)
return sample_source, sample_target
def __len__(self):
return self.dataset_size
if __name__ == "__main__":
enable_multiprocessing = True
enable_persistance = True
enable_triton = True
dataset_train = DatasetDummy(
True
)
dataset_test = DatasetDummy(
False
)
loader_train = torch.utils.data.DataLoader(
dataset_train,
batch_size=16,
num_workers=1 if enable_multiprocessing else 0,
persistent_workers=enable_persistance if enable_multiprocessing else None,
)
loader_test = torch.utils.data.DataLoader(
dataset_test,
batch_size=16,
num_workers=1 if enable_multiprocessing else 0,
persistent_workers=enable_persistance if enable_multiprocessing else None,
)
for epoch in range(16):
print(f"Epoch: {epoch:>3}")
print(f"Train...")
for batch_index, batch in enumerate(loader_train):
print(f"Batch: {batch_index:>3}")
batch_source, batch_target = batch
if enable_triton:
batch_add = add(
batch_source.cuda(),
batch_target.cuda(),
)
else:
batch_add = batch_source.cuda() + batch_target.cuda()
print(f"Test...")
for batch_index, batch in enumerate(loader_test):
print(f"Batch: {batch_index:>3}")
batch_source, batch_target = batch
if enable_triton:
batch_add = add(
batch_source.cuda(),
batch_target.cuda(),
)
else:
batch_add = batch_source.cuda() + batch_target.cuda()
It seems the problem really kicks in when DataLoader workers are created.
I would greatly appreciate help on this, as the overhead of disabling multiprocessing makes Triton unusable for my application, yet Triton greatly improves the performance of important computational bottlenecks.
I have yet another version of the code that confirms the issue occurs when DataLoader workers are started after Triton compilation. Setting enable_preload=True
and enable_persistance=True
launches the workers early and seems to prevent all crashes:
if __name__ == "__main__":
enable_multiprocessing = True
enable_persistance = True
enable_triton = True
enable_preload = True
dataset_train = DatasetDummy(
True
)
dataset_test = DatasetDummy(
False
)
loader_train = torch.utils.data.DataLoader(
dataset_train,
batch_size=16,
num_workers=1 if enable_multiprocessing else 0,
persistent_workers=enable_persistance if enable_multiprocessing else None,
)
loader_test = torch.utils.data.DataLoader(
dataset_test,
batch_size=16,
num_workers=1 if enable_multiprocessing else 0,
persistent_workers=enable_persistance if enable_multiprocessing else None,
)
if enable_preload:
for batch in loader_train:
break
for batch in loader_test:
break
for epoch in range(16):
print(f"Epoch: {epoch:>3}")
print(f"Train...")
for batch_index, batch in enumerate(loader_train):
print(f"Batch: {batch_index:>3}")
batch_source, batch_target = batch
if enable_triton:
batch_add = add(
batch_source.cuda(),
batch_target.cuda(),
)
else:
batch_add = batch_source.cuda() + batch_target.cuda()
print(f"Test...")
for batch_index, batch in enumerate(loader_test):
print(f"Batch: {batch_index:>3}")
batch_source, batch_target = batch
if enable_triton:
batch_add = add(
batch_source.cuda(),
batch_target.cuda(),
)
else:
batch_add = batch_source.cuda() + batch_target.cuda()
I think I'm running into the same problem. Getting ERROR: Unexpected segmentation fault encountered in worker
on multiple workers while training a model that uses Triton kernels. Persistent workers doesn't fix the issue. Setting num_workers=0
prevents the segfault but training is CPU bottlenecked. Crashes always happen at the end of an epoch (presumably when workers are relaunched), but not always on the first epoch.
I pulled a core dump from one of the segfaulted workers.
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x000071f4d42ab393 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78
#2 0x000071f4d425a6c8 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3 0x000071f4d1562e13 in handler_SIGSEGV(int, siginfo_t*, void*) ()
from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#4 <signal handler called>
#5 __pthread_clockjoin_ex (threadid=125287851361984, thread_return=0x0, clockid=0, abstime=0x0, block=true) at pthread_join_common.c:43
#6 0x000071f392131228 in llvm::llvm_thread_join_impl(unsigned long) ()
from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#7 0x000071f3942b0408 in llvm::ThreadPool::~ThreadPool() ()
from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#8 0x000071f392e590b9 in mlir::MLIRContextImpl::~MLIRContextImpl() ()
from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#9 0x000071f392e52d27 in mlir::MLIRContext::~MLIRContext() ()
from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#10 0x000071f390d228da in std::default_delete<mlir::MLIRContext>::operator() (this=<optimized out>, __ptr=0x71ecb11e9db0)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#11 std::default_delete<mlir::MLIRContext>::operator() (__ptr=0x71ecb11e9db0, this=<optimized out>)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#12 std::unique_ptr<mlir::MLIRContext, std::default_delete<mlir::MLIRContext> >::~unique_ptr (this=<optimized out>,
__in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:361
#13 pybind11::class_<mlir::MLIRContext>::dealloc (v_h=...) at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/pybind11.h:1880
#14 0x000071f390cd6840 in pybind11::detail::clear_instance (self=0x71f4134bec30)
at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:424
#15 0x000071f390cd7431 in pybind11::detail::pybind11_object_dealloc (self=0x71f4134bec30)
at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:457
#16 0x000071f4d48a9ea3 in _Py_Dealloc (op=0x71f2d9e006c0) at Objects/object.c:2390
#17 Py_DECREF (op=0x71f2d9e006c0) at ./Include/object.h:538
#18 _PyObject_ClearInstanceAttributes (self=0x71f418cfeb10) at Objects/dictobject.c:5566
#19 subtype_clear (self=0x71f418cfeb10) at Objects/typeobject.c:1279
#20 0x000071f4d481e8b8 in delete_garbage (tstate=0x71f4d5b48c58 <_PyRuntime+166328>, gcstate=0x71f4d5b2eb60 <_PyRuntime+59584>,
collectable=0x7fff5ac9ea10, old=0x71f4d5b2eba8 <_PyRuntime+59656>) at Modules/gcmodule.c:1013
And version info:
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Artix Linux (x86_64)
GCC version: (GCC) 14.1.1 20240507
Clang version: 17.0.6
CMake version: version 3.29.3
Libc version: glibc-2.39
Python version: 3.11.8 (main, Feb 25 2024, 04:18:18) [Clang 17.0.6 ] (64-bit runtime)
Python platform: Linux-6.8.9-artix1-2-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
Nvidia driver version: 550.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 73%
CPU max MHz: 4568.1641
CPU min MHz: 2200.0000
BogoMIPS: 7603.36
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
L1d cache: 768 KiB (24 instances)
L1i cache: 768 KiB (24 instances)
L2 cache: 12 MiB (24 instances)
L3 cache: 128 MiB (8 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==2.2.4
[pip3] torch==2.3.0
[pip3] torch-dicom==0.1.dev68+g40a15aa
[pip3] torchmetrics==1.4.0
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[pip3] triton-helpers==0.1.dev16+g179af43
[conda] Could not collect
I also met this bug, and created a issue on the pytorch side.
Some workaround methods that may work:
Not sure when and where they work. @TidalPaladin @leademeule
@flishwang thank you for sharing. On my side I have continued using the persistent data loader trick to avoid the crash. It thankfully has worked consistently over the last few weeks. A proper fix would be much appreciated however.
@flishwang I have adopted the manually break early strategy. Since I'm using PyTorch Lightning I don't have easy access to the data loader directly, but setting limit_train_batches=0.95, limit_val_batches=0.95
in pl.Trainer
does the trick.
It seems that this issue is not fully mitigated by the early break workaround. When running for a large number of epochs (200+) the error reappears. This is still much longer than I would be able to run without early breaking. For now I have disabled the Triton components of my model and have had no issues since.
I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults
I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults
which commit
This hasn't resolved for me even after the triton 3.0 update. Although with Triton 2.3.1, it used to happen every time.
ERROR: Unexpected segmentation fault encountered in worker. Traceback (most recent call last): File ".../script_train.py", line 73, in main_wrapper() File ".../script_train.py", line 34, in main_wrapper main(data_dir, exp, method, optim, File ".../script_train.py", line 69, in main train(config, dataloader, model, model_path, device) File ".../train.py", line 113, in train loss_train.append(losstrain.item()) File ".../pytorch-2.3.1_cu121_py310_triton/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 787) is killed by signal: Segmentation fault.
@flishwang @leademeule the problem still happens. Does anyone know good workaround for PyTorch Lightning?
I am working on a project that involves restructuring a network over different phases of training. Key aspects of this involves calls to custom Triton code, which is compiled and autotuned on the fly as the different phases of training are reached. Occasionally and without clear pattern, this would result in a segmentation fault through a DataLoader when Triton code is compiled, either immediately or when attempting to load the next batch.
I eventually narrowed it down to something very simple. It only seems to depend on compiling Triton code while DataLoaders are used with multiprocessing, and without persistance. Below is code that reproduces the issue consistently. If
enable_multiprocessing=True
andenable_persistance=False
andenable_triton=True
, the program crashes when attempting to query the DataLoaders on the second epoch. Ifenable_multiprocessing=False
orenable_persistance=True
orenable_triton=False
, the program does not crash.I am surprised this is an issue as it seems like a very common use case. I hope this can be resolved.
This gives the following backtrace:
Below are the specifications of two systems I used to reproduce the issue: