[BUG]: ZeRO3 offload use all GPU memory

MikeChenfu commented 1 year ago

🐛 Describe the bug

Hello, I am training OPT model on the A100 GPUs. I found it used 76GB GPU memory when I use auto mode and set gpu_margin_mem_ratio as 0. If I use cpu mode, it only takes about 15GB. In my understanding, both two methods should use the same GPU memory.

Also I got different connection errors when I use auto mode and set the gpu_margin_mem_ratio as non-zero like 0.2 within two nodes. It works well on the single node but seems gpu_margin_mem_ratio value does not control GPU memory usage.

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3463 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to shutdown the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1079, in num_nodes_waiting
    self._state_holder.sync()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
    ) from exc
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807843 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31538) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 309.7293393611908 seconds
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 911, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------

Environment

### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.7
PyTorch version: 1.12.1
System CUDA version: 11.3
CUDA version required by PyTorch: 11.3

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: x
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

1SAA commented 1 year ago

Hi @MikeChenfu

I think you misunderstand the meaning of the arguement gpu_margin_mem_ratio. When using auto policy in Gemini, we will automatically detect your GPU memory usage and try to make full use of your CUDA memory. Gemini leave as much as possible parameters in CUDA during training by default, but some users want to pose optimizer states in CUDA and update a part of parameters with GPU. gpu_margin_mem_ratio means the ratio storing optimizer states in the gap between your maximum CUDA memory usage and your full CUDA capacity.

As for the problem in multiple nodes, we may fix this bug soon.

MikeChenfu commented 1 year ago

Thanks @1SAA for the update. Previously I had to adjust gpu_margin_mem_ratio for better performance. It is good to hear automatically detect your GPU memory usage. Does it mean I only use auto policy without passing gpu_margin_mem_ratio as an input parameter?

flybird11111 commented 1 year ago

I think you misunderstand the meaning of the arguement gpu_margin_mem_ratio. When using auto policy in Gemini, we will automatically detect your GPU memory usage and try to make full use of your CUDA memory. Gemini leave as much as possible parameters in CUDA during training by default, but some users want to pose optimizer states in CUDA and update a part of parameters with GPU. gpu_margin_mem_ratio means the ratio storing optimizer states in the gap between your maximum CUDA memory usage and your full CUDA capacity.

If you want to store some or more optimizer states in CUDA and update a part of parameters with the GPU, you can increase the gpu_margin_mem_ratio.

hpcaitech / ColossalAI

[BUG]: ZeRO3 offload use all GPU memory #3155

🐛 Describe the bug

Environment