pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.58k stars 22.78k forks source link

torch.multiprocessing.start_processes is blocking with large input arguments #133010

Open d4l3k opened 3 months ago

d4l3k commented 3 months ago

🐛 Describe the bug

With small input arguments (<64kb) start_processes runs quickly as the processes are launched asynchronously.

When they're large we end up blocking in https://github.com/python/cpython/blob/main/Lib/multiprocessing/popen_spawn_posix.py#L62 when writing to the pipe. The default pipe buffer size is 64kb so larger than that requires the child process to fully start.

https://unix.stackexchange.com/questions/11946/how-big-is-the-pipe-buffer

Repro:

import time
from torch.multiprocessing.spawn import start_processes
import os

time.sleep(1)

def trainer(rank):
    print(rank)

if __name__ == '__main__':
    world_size = 10

    start = time.perf_counter()

    args = ["1"*100000]

    ctx = start_processes(
            fn=trainer,
            args=args,
            nprocs=world_size,
            start_method="spawn",
            join=False,
        )

    print(f"Time taken: {time.perf_counter() - start}")

Versions

Collecting environment information...
PyTorch version: 2.5.0a0+git21d4c48
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.34

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-0_fbk21_hardened_12633_g4db063a1bcb5-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100
GPU 1: NVIDIA H100
GPU 2: NVIDIA H100
GPU 3: NVIDIA H100

Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.8.1
/usr/lib64/libcudnn.so.9.1.1
/usr/lib64/libcudnn_adv.so.9.1.1
/usr/lib64/libcudnn_adv_infer.so.8.8.1
/usr/lib64/libcudnn_adv_train.so.8.8.1
/usr/lib64/libcudnn_cnn.so.9.1.1
/usr/lib64/libcudnn_cnn_infer.so.8.8.1
/usr/lib64/libcudnn_cnn_train.so.8.8.1
/usr/lib64/libcudnn_engines_precompiled.so.9.1.1
/usr/lib64/libcudnn_engines_runtime_compiled.so.9.1.1
/usr/lib64/libcudnn_graph.so.9.1.1
/usr/lib64/libcudnn_heuristic.so.9.1.1
/usr/lib64/libcudnn_ops.so.9.1.1
/usr/lib64/libcudnn_ops_infer.so.8.8.1
/usr/lib64/libcudnn_ops_train.so.8.8.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          184
On-line CPU(s) list:             0-183
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 9654 96-Core Processor
CPU family:                      25
Model:                           17
Thread(s) per core:              1
Core(s) per socket:              184
Socket(s):                       1
Stepping:                        1
BogoMIPS:                        4792.79
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       11.5 MiB (184 instances)
L1i cache:                       11.5 MiB (184 instances)
L2 cache:                        92 MiB (184 instances)
L3 cache:                        2.9 GiB (184 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-183
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-coding==1.3.3
[pip3] flake8-comprehensions==3.15.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] optree==0.12.1
[pip3] pytorch-triton==3.0.0+a9bc1a3647
[pip3] torch==2.5.0a0+git21d4c48
[pip3] torchx==0.6.0
[conda] blas                      1.0                         mkl  
[conda] magma-cuda121             2.6.1                         1    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-include               2023.2.0            intel_49495    intel
[conda] mkl-service               2.4.0           py310h5eee18b_1  
[conda] mkl-static                2023.2.0            intel_49495    intel
[conda] mkl_fft                   1.3.8           py310h5eee18b_0  
[conda] mkl_random                1.2.4           py310hdb19cb5_0  
[conda] numpy                     1.26.0                   pypi_0    pypi
[conda] numpy-base                1.26.4          py310hb5e798b_0  
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] pytorch-triton            3.0.0+a9bc1a3647          pypi_0    pypi
[conda] torch                     2.5.0a0+git21d4c48           dev_0    <develop>
[conda] torchfix                  0.4.0                    pypi_0    pypi
[conda] torchx                    0.6.0                    pypi_0    pypi

cc @VitalyFedyunin

d4l3k commented 3 months ago

One solution for this would be to launch each process using a threadpool which would allow for us to not block as much on mp.Process.start. Using threads may have other negative side effects since there's more overhead but generally should be fine

Subprocess supports specifying the pipe size but there doesn't seem anyway to do this with mp.Process

tushar1023 commented 2 months ago

import time from torch.multiprocessing import spawn import os

time.sleep(1)

def trainer(rank): print(f"Running trainer in process {rank}")

if name == 'main': world_size = 10

start = time.perf_counter()

# No need for large arguments, just pass simple arguments
args = []

# Spawn the processes
spawn(
    fn=trainer,
    args=args,  # Empty args passed to each process
    nprocs=world_size,  # Number of processes
    join=True  # Wait for processes to complete
)

print(f"Time taken: {time.perf_counter() - start}")