Open niehen6174 opened 2 weeks ago
This error is likely originating from OneFlow, and I have also seen similar issues in other issues(#393 #1080 ), but none of them have provided a solution.
@strint Could you please take a look at this, or do you have any suggestions for a solution?
Your current environment information
ibibverbs not available, ibv_fork_init skipped Collecting environment information... PyTorch version: 2.1.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OneFlow version: path: ['/opt/conda/lib/python3.10/site-packages/oneflow'], version: 0.9.1.dev20241019+cu118, git_commit: d23c061, cmake_build_type: Release, rdma: True, mlir: True, enterprise: False Nexfort version: none OneDiff version: 1.2.1.dev15+g241fe57d OneDiffX version: none
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.28
Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40 Nvidia driver version: 525.125.06 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.9.7 /usr/lib64/libcudnn_adv_infer.so.8.9.7 /usr/lib64/libcudnn_adv_train.so.8.9.7 /usr/lib64/libcudnn_cnn_infer.so.8.9.7 /usr/lib64/libcudnn_cnn_train.so.8.9.7 /usr/lib64/libcudnn_ops_infer.so.8.9.7 /usr/lib64/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9K84 96-Core Processor Stepping: 0 CPU MHz: 2600.034 BogoMIPS: 5200.06 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-191 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Versions of relevant libraries: [pip3] diffusers==0.30.3 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] onnx==1.17.0 [pip3] onnxruntime==1.19.2 [pip3] onnxruntime-gpu==1.18.0 [pip3] open-clip-torch==2.20.0 [pip3] torch==2.1.1 [pip3] torchaudio==2.1.1 [pip3] torchsde==0.2.6 [pip3] torchvision==0.16.1 [pip3] transformers==4.44.2 [pip3] triton==2.1.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.10 py310h5eee18b_0
[conda] mkl_random 1.2.7 py310h1128e8f_0
[conda] numpy 1.26.4 py310h5f9d8c6_0
[conda] numpy-base 1.26.4 py310hb5e798b_0
[conda] open-clip-torch 2.20.0 pypi_0 pypi [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 2.1.1 pypi_0 pypi [conda] torchaudio 2.1.1 py310_cu121 pytorch [conda] torchsde 0.2.6 pypi_0 pypi [conda] torchvision 0.16.1 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypi
🐛 Describe the bug
I encountered a Segmentation fault issue while using OneDiff in ComfyUI, with no additional error information. I am seeking some assistance.
I encountered two types of Segmentation fault errors:
After testing, this error is unrelated to a single workflow and is also unrelated to previously executed workflows. I currently have no leads and have not been able to reproduce it again; the occurrence seems to be quite random.