tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

65B on multiple GPUs : CUDA out of memory with 4 x GPU RTX A5000 (24GB) / 96GB in total #18

Open scampion opened 1 year ago

scampion commented 1 year ago

For the moment, I can't run the 65B model with 4 GPUs and a total of 96GB.

I investigate, bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable are a first idea ...

[1] % torchrun --nproc_per_node 4 example.py --ckpt_dir ../../LLaMA/30B --tokenizer_path ../../LLaMA/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Allocating transformer on host
Allocating transformer on host
Allocating transformer on host
Allocating transformer on host
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 132, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.08 GiB already allocated; 6.94 MiB free; 5.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 887816) of binary: /home/scampion/Code/llama/venv/bin/python
Traceback (most recent call last):
  File "/home/scampion/Code/llama/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 887817)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 887818)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 887819)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 887816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(venv)
scampion commented 1 year ago

After recompiled bitsandbytes from source with a compliant version of CUDA 11.7 supported by torch. The issue is still there

scampion commented 1 year ago

My mistake, the example.py doesn't support multi GPUs. WIP

entn-at commented 1 year ago

It's complicated. This fork got rid of many things required for multi-GPU usage. One way to restore that would be to create adapted versions of the model parallel layers in fairscale (https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/model_parallel/layers.py) that use bitsandbytes.