microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.58k stars 4.14k forks source link

[BUG] Distributed Training randomly stuck in trainings loop #6524

Open raeudigerRaeffi opened 2 months ago

raeudigerRaeffi commented 2 months ago

Hi I have a script that runs with the DataParralell trainer on a machine with 8 H100 GPUs (aws p5 VM) with deepspeed. When we run the script it starts to randomly get stuck forever at some iteration relatively late in the process (between 2000 - 4000th iteration). We start the script with the following command:

accelerate launch src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml The gpus are only at 30% memory occupied and util is at 0%. The stack trace of the relevant processes looks the following:

pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {} Process 39: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml Python v3.10.12 (/usr/bin/python3.10)

Thread 39 (idle): "MainThread" backward (torch/autograd/init.py:266) backward (torch/_tensor.py:522) backward (deepspeed/runtime/fp16/loss_scaler.py:63) backward (deepspeed/runtime/zero/stage3.py:2213) wrapped_fn (deepspeed/utils/nvtx.py:15) backward (deepspeed/runtime/engine.py:1976) wrapped_fn (deepspeed/utils/nvtx.py:15) backward (accelerate/utils/deepspeed.py:166) backward (accelerate/accelerator.py:2126) training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:410) training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:540) main (src/model_back/healing/scripts/fine_tune_accelerate.py:583)

(src/model_back/healing/scripts/fine_tune_accelerate.py:587) Thread 930 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Thread 4067 (active) all_gather_into_tensor (torch/distributed/distributed_c10d.py:2709) wrapper (torch/distributed/c10d_logger.py:72) all_gather_into_tensor (deepspeed/comm/torch.py:219) _fn (torch/_dynamo/eval_frame.py:489) all_gather_into_tensor (deepspeed/comm/comm.py:305) log_wrapper (deepspeed/comm/comm.py:117) allgather_fn (deepspeed/comm/comm.py:320) wrapped_fn (deepspeed/utils/nvtx.py:15) _dist_allgather_fn (deepspeed/runtime/zero/partition_parameters.py:93) all_gather_coalesced (deepspeed/runtime/zero/partition_parameters.py:1217) wrapped_fn (deepspeed/utils/nvtx.py:15) __all_gather_params_ (deepspeed/runtime/zero/partitioned_param_coordinator.py:463) __all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:434) wrapped_fn (deepspeed/utils/nvtx.py:15) fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:385) decorate_context (torch/utils/_contextlib.py:115) wrapped_fn (deepspeed/utils/nvtx.py:15) _fn (torch/_dynamo/eval_frame.py:489) pre_sub_module_backward_function (deepspeed/runtime/zero/parameter_offload.py:474) decorate_context (torch/utils/_contextlib.py:115) _run_before_backward_function (deepspeed/runtime/zero/parameter_offload.py:339) wrapped_fn (deepspeed/utils/nvtx.py:15) backward (deepspeed/runtime/zero/parameter_offload.py:358) apply (torch/autograd/function.py:289) backward (torch/autograd/__init__.py:266) backward (torch/utils/checkpoint.py:320) apply (torch/autograd/function.py:289) Thread 4069 (idle) Thread 4070 (idle) Thread 4071 (idle) Thread 4072 (idle) Thread 4073 (idle) Thread 4074 (idle) Thread 4075 (idle) Process 40: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml Python v3.10.12 (/usr/bin/python3.10) Thread 40 (idle): "MainThread" backward (torch/autograd/__init__.py:266) backward (torch/_tensor.py:522) backward (deepspeed/runtime/fp16/loss_scaler.py:63) backward (deepspeed/runtime/zero/stage3.py:2213) wrapped_fn (deepspeed/utils/nvtx.py:15) backward (deepspeed/runtime/engine.py:1976) wrapped_fn (deepspeed/utils/nvtx.py:15) backward (accelerate/utils/deepspeed.py:166) backward (accelerate/accelerator.py:2126) training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:410) training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:540) main (src/model_back/healing/scripts/fine_tune_accelerate.py:583) (src/model_back/healing/scripts/fine_tune_accelerate.py:587) Thread 924 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Thread 4040 (idle) Thread 4044 (active) all_gather_into_tensor (torch/distributed/distributed_c10d.py:2709) wrapper (torch/distributed/c10d_logger.py:72) all_gather_into_tensor (deepspeed/comm/torch.py:219) _fn (torch/_dynamo/eval_frame.py:489) all_gather_into_tensor (deepspeed/comm/comm.py:305)
tohtana commented 2 months ago

Hi @raeudigerRaeffi, As you are running Mixtral, you might need to enable ZeRO3's leaf module. You can find the example here: https://github.com/microsoft/DeepSpeed/pull/5008#issuecomment-1910607845

raeudigerRaeffi commented 2 months ago

Hi @tohtana thanks for your reply. Sadly this did not fix my issue I am still running into the training randomly getting stuck. The GPU usage is also odd this time with all cards except one being at 100%. Interestingly though I just looked through the stack and noticed that this time all active threads are stuck in the same function (partition_grads) which was previously never the case. To maybe add further to this we are able to run our code successfully on A100 gpus and the issue only occurs when we switch to P5 machines on was that run on H100.

Thu Sep 12 23:12:25 2024
+-+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-+-+-+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=+=+=|
|   0  NVIDIA H100 80GB HBM3          On  | 0:53:00.0 Off |                           0 |
| N/A   37C    P0             116W / 700W |  55217MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   1  NVIDIA H100 80GB HBM3          On  | 0:64:00.0 Off |                           0 |
| N/A   39C    P0             121W / 700W |  56817MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   2  NVIDIA H100 80GB HBM3          On  | 0:75:00.0 Off |                           0 |
| N/A   35C    P0             115W / 700W |  56717MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   3  NVIDIA H100 80GB HBM3          On  | 0:86:00.0 Off |                           0 |
| N/A   41C    P0             119W / 700W |  56485MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   4  NVIDIA H100 80GB HBM3          On  | 0:97:00.0 Off |                           0 |
| N/A   39C    P0             116W / 700W |  54407MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   5  NVIDIA H100 80GB HBM3          On  | 0:A8:00.0 Off |                           0 |
| N/A   36C    P0             108W / 700W |  55677MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   6  NVIDIA H100 80GB HBM3          On  | 0:B9:00.0 Off |                           0 |
| N/A   38C    P0             116W / 700W |  53081MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-+-+-+
|   7  NVIDIA H100 80GB HBM3          On  | 0:CA:00.0 Off |                           0 |
| N/A   36C    P0             113W / 700W |  53609MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-+-+-+    
Python v3.10.12 (/usr/bin/python3.10)

Thread 4815 (idle): "MainThread"
    _invoke_run (torch/distributed/elastic/agent/server/api.py:835)
    run (torch/distributed/elastic/agent/server/api.py:680)
    wrapper (torch/distributed/elastic/metrics/api.py:124)
    launch_agent (torch/distributed/launcher/api.py:255)
    __call__ (torch/distributed/launcher/api.py:133)
    run (torch/distributed/run.py:892)
    deepspeed_launcher (accelerate/commands/launch.py:852)
    launch_command (accelerate/commands/launch.py:1159)
    main (accelerate/commands/accelerate_cli.py:48)
    <module> (accelerate:8)
Running py-spy dump on PID 4884
Process 4884: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4884 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5400 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8295 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8298 (idle)
Thread 8301 (idle)
Thread 8304 (idle)
Thread 8308 (idle)
Thread 8310 (idle)
Thread 8312 (idle)
Thread 8315 (idle)
Running py-spy dump on PID 4885
Process 4885: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4885 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5407 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8291 (idle)
Thread 8292 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8293 (idle)
Thread 8296 (idle)
Thread 8299 (idle)
Thread 8305 (idle)
Thread 8307 (idle)
Thread 8302 (idle)
Running py-spy dump on PID 4886
Process 4886: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4886 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5408 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8294 (idle)
Thread 8297 (idle)
Thread 8300 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8303 (idle)
Thread 8306 (idle)
Thread 8309 (idle)
Thread 8311 (idle)
Thread 8313 (idle)
Running py-spy dump on PID 4887
Process 4887: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4887 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5409 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8317 (idle)
Thread 8321 (idle)
Thread 8324 (idle)
Thread 8326 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8328 (idle)
Thread 8331 (idle)
Thread 8334 (idle)
Thread 8336 (idle)
Running py-spy dump on PID 4888
Process 4888: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4888 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5406 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8283 (idle)
Thread 8284 (idle)
Thread 8285 (idle)
Thread 8286 (idle)
Thread 8287 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8288 (idle)
Thread 8289 (idle)
Thread 8290 (idle)
Running py-spy dump on PID 4889
Process 4889: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4889 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5405 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8314 (idle)
Thread 8316 (idle)
Thread 8319 (idle)
Thread 8320 (idle)
Thread 8323 (idle)
Thread 8327 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8330 (idle)
Thread 8333 (idle)
Running py-spy dump on PID 4890
Process 4890: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4890 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5404 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8276 (idle)
Thread 8275 (idle)
Thread 8277 (idle)
Thread 8278 (idle)
Thread 8279 (idle)
Thread 8280 (idle)
Thread 8281 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)
Thread 8282 (idle)
Running py-spy dump on PID 4891
Process 4891: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)

Thread 4891 (idle): "MainThread"
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/_tensor.py:521)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage3.py:2247)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (deepspeed/runtime/engine.py:2020)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2188)
    training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:411)
    training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:539)
    main (src/model_back/healing/scripts/fine_tune_accelerate.py:582)
    <module> (src/model_back/healing/scripts/fine_tune_accelerate.py:586)
Thread 5410 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 8318 (idle)
Thread 8322 (idle)
Thread 8325 (idle)
Thread 8329 (idle)
Thread 8332 (idle)
Thread 8335 (idle)
Thread 8337 (idle)
Thread 8338 (active)
    partition_grads (deepspeed/runtime/zero/stage3.py:1500)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    __reduce_and_partition_ipg_grads (deepspeed/runtime/zero/stage3.py:1309)
    decorate_context (torch/utils/_contextlib.py:116)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage3.py:1257)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage3.py:1516)
    reduce_leaf_module_grads (deepspeed/runtime/zero/stage3.py:1196)
    wrapped_fn (deepspeed/utils/nvtx.py:18)
    _engine_run_backward (torch/autograd/graph.py:769)
    backward (torch/autograd/__init__.py:289)
    backward (torch/utils/checkpoint.py:314)
    apply (torch/autograd/function.py:306)