learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
MIT License
2k stars 160 forks source link

forking before initialization of the MPFuture handler - server runtime not initialized in WSL --new_hive #581

Open poedator opened 1 year ago

poedator commented 1 year ago

Apparently there is an issue of forking before initialization of the MPFuture handler. It happens when running private hive under WSL. No such problem happens when runnging pure linux or WSL with real hive.

What likely happens: --> The ConnectionHandlers were created - as a process - while we were still initializing the background thread - the MPFuture handler - and they forked the state of the main process, in which the MPFuture handler was not fully initialized - and was broken. --> the ConnectionHandlers could not write self.ready.set_result - this method is serviced by their (broken) MPFuture handler. --> the ModuleContainer got stuck in run -> handler.run_in_background -> handler.wait_until_ready(). --> the ModuleContainer never reached Runtime.run() - it was called after handler.run_in_background(). --> the Runtime never started processing batches. --> the connection handlers, which received a request from the client, gave the task to the Runtime, but did not receive a response - the Runtime was never launched. --> the client did not receive a response from the server.

How to reproduce: run in WSL: HIVEMIND_LOGLEVEL=DEBUG python -m petals.cli.run_server bigscience/bloom-560m --new_swarm --identity tests/test.id --host_maddrs /ip4/127.0.0.1/tcp/32337 --throughput 1 --torch_dtype float32 --compression NONE --attn_cache_tokens 2048 --max_chunk_size_bytes 1024

Problem symptoms: Server runtime seems inactive.

Environment Please list:

If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.

PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 531.79 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 8 Model name: AMD Ryzen 7 2700X Eight-Core Processor Stepping: 2 CPU MHz: 3700.062 BogoMIPS: 7400.12 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 256 KiB L1i cache: 512 KiB L2 cache: 4 MiB L3 cache: 16 MiB Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt_ssbd arat

Versions of relevant libraries: [pip3] mypy==0.991 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.4 [pip3] torch==2.0.1 [pip3] triton==2.0.0 [conda] blas 2.16 mkl conda-forge [conda] libblas 3.8.0 16_mkl conda-forge [conda] libcblas 3.8.0 16_mkl conda-forge [conda] liblapack 3.8.0 16_mkl conda-forge [conda] liblapacke 3.8.0 16_mkl conda-forge [conda] mkl 2020.2 256 [conda] numpy 1.25.1 pypi_0 pypi [conda] pytorch 2.0.1 py3.10_cuda11.7_cudnn8.5.0_0 pytorch [conda] pytorch-cuda 11.7 h778d358_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchtriton 2.0.0 py310 pytorch

poedator commented 1 year ago

also observed this issue on Linux, (private swarm, Bloom-560)