vasili111 commented 1 year ago

My hardware

CPU and memory: 24 cores with 16GB memory per core.
GPU: 2 x NVIDIA A100 80GB

When running nvidia-smi this is the output:

Fri Sep 29 23:55:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:25:00.0 Off |                    0 |
| N/A   35C    P0              44W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:81:00.0 Off |                    0 |
| N/A   33C    P0              44W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

example_chat_completion.py example_chat_completion.py file used below with all models is unmodified from original repo: https://github.com/facebookresearch/llama/blob/main/example_chat_completion.py

llama-2-7b-chat I am following README.md and succesfully run "llama-2-7b-chat" model with:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

With "llama-2-7b-chat" everything works well.

llama-2-13b-chat Now I am trying to modify code above that runs "llama-2-7b-chat" to run "llama-2-13b-chat":

torchrun --nproc_per_node 2 example_chat_completion.py \
    --ckpt_dir llama-2-13b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

After running it this is the output:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 14.85 seconds

and after nothing happens.

llama-2-70b-chat Also, I am trying to run "llama-2-70b-chat":

torchrun --nproc_per_node 8 example_chat_completion.py \
    --ckpt_dir llama-2-13b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

but getting following error:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 8
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
          component_trace = _Fire(component, args, parsed_flag_args, context, name) 
component_trace = _Fire(component, args, parsed_flag_args, context, name)     
 component_trace = _Fire(component, args, parsed_flag_args, context, name)                                  
       ^ ^   ^ ^    ^  ^    ^  ^   ^   ^  ^  ^  ^ ^  ^^^^^^ ^  ^     ^^ ^^ ^^^^^^^^ ^ ^^^ ^^^^^ ^ ^ ^^^^^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
^^^^^^^^^^^^^^^^^^^^ ^
 ^ ^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
 ^ ^ ^ ^  
      File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
                  ^^^^^ ^ ^ ^ ^^^    ^^component, remaining_args = _CallAndUpdateTrace(^^
^^^^^^^^^ ^^ ^^ ^^  ^^      component, remaining_args = _CallAndUpdateTrace(^ 
^  ^   ^  ^  ^  ^                                 ^   ^ ^^^ ^   ^ ^^^ ^  ^^^^^^^^^^^    ^^^component, remaining_args = _CallAndUpdateTrace(^
^^
 ^ ^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
 ^       ^ ^ ^       ^  ^ ^     ^ ^ ^   
^ ^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
^^^^^^^^^^^^^^^^^^^^^^    ^^component = fn(*varargs, **kwargs)^^
^^^^^^^     ^^ component = fn(*varargs, **kwargs)^^^ 
^ ^^ 
^   ^   File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
 ^  ^  ^ ^  ^  ^  ^  ^ ^  ^  ^^ ^ ^^^ ^ ^^^^^^^^    ^^^^^component = fn(*varargs, **kwargs)^^^^
^^^^^^^^^^^^^ ^^^ ^^^ ^^^ ^^^^ ^^^^ ^^^^^ ^^^ ^^^ ^^^ ^
^ ^^ ^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
 ^^ ^^ ^^^^ ^^^^^^^^^^^^^    ^^^^generator = Llama.build(^
^ 
    File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
       ^     ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
^^^^^    ^^^component, remaining_args = _CallAndUpdateTrace(^^
^ ^ ^ ^ ^   ^ ^^ ^ 
  ^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
 ^ ^  ^  
         File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
       ^^^^^^^^
^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
^^^^^^^^^^^^^
^^      File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component, remaining_args = _CallAndUpdateTrace(
generator = Llama.build(
                     generator = Llama.build(  
                  ^   ^  ^   ^   ^  ^ ^   ^      ^ ^ component, remaining_args = _CallAndUpdateTrace(^^^ 

^   File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
^  ^  ^   ^^^ ^^ ^ ^^ ^ ^^^ ^^ ^^ ^ ^ ^
 ^         File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
 component = fn(*varargs, **kwargs)  
       ^  ^ ^ ^^^^^^^^
^^^^^  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
      File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
  ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
    component = fn(*varargs, **kwargs)
    generator = Llama.build(
                            component = fn(*varargs, **kwargs)      ^^
^^^^^^^^^^^^^ ^ ^^^ ^^ ^^ ^^ ^^ ^
 ^   File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
^ ^ ^ ^ ^ 
      File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
^^^^^^^^^^^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/llama/llama/generation.py", line 92, in build
    torch.cuda.set_device(local_rank)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch.cuda.set_device(local_rank)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch.cuda.set_device(local_rank)
    torch.cuda.set_device(local_rank)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch.cuda.set_device(local_rank)    
torch.cuda.set_device(local_rank)  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device

  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
    torch._C._cuda_setDevice(device)
RuntimeError:     CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
torch._C._cuda_setDevice(device)

    torch._C._cuda_setDevice(device)
    RuntimeErrortorch._C._cuda_setDevice(device): 
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    RuntimeErrortorch._C._cuda_setDevice(device): 
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 95468 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 95469 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 95470) of binary: /data/user/home/vbachi/llama_2/conda_env_v1/conda_env/bin/python
Traceback (most recent call last):
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/user/home/vbachi/llama_2/conda_env_v1/conda_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 95471)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 95472)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 95473)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 95474)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 95475)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-29_23:22:57
  host      : c0249.cm.cluster
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 95470)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Question: How to correctly run "llama-2-13b-chat" and "llama-2-70b-chat"?

raghu-007 commented 1 year ago

It's a CUDA Device Issue!!!

use the code to set the CUDA device : (set CUDA Device)

import torch

keep the Cuda device to the GPU, Example: GPU 0

torch.cuda.set.device(0)

check, CUDA & cuDNN compatibility

HamidShojanazeri commented 1 year ago

@vasili111 couple of notes here, I am not able to repro your 13B-chat issue, for the 70B torchrun --nproc_per_node 8 you are running 8 processes which requires 8 GPUs and you have two available, change it to 2, with fp16 might be able t o pull it off.

vasili111 commented 1 year ago

@raghu-007 and @HamidShojanazeri

Thank you for your replies and help.

I checked CUDA and Pytorch installation and Pytorch sees two GPUs and is able to run this code:

import torch
x = torch.rand(5, 3)
print(x)

I am using CUDA-11.8.0 and cuDNN/8.9.2.26. Those are versions that are recommended as compatible by admin of HPC computer that I am using. Please let me know if this can be an issue.

It seems the problem with 70B model is as @HamidShojanazeri suggested the number of GPUs that are available (two GPUs) and asking with torchrun --nproc_per_node 8 to use 8 GPUs.

But I am also not able to run 13B model where I have 2 GPUs and I am asking to use 2 GPUs with --nproc_per_node 2 like this:

torchrun --nproc_per_node 2 example_chat_completion.py \
    --ckpt_dir llama-2-13b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

The output I am getting:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 14.85 seconds

and nothing else happens. So here I have 2 GPUs and asking to use 2 GPUs but it does not works any way. What can be problem?

@HamidShojanazeri

with fp16 might be able t o pull it off

Could you please clarify what exactly fp16 means here? Sorry, new to DL/LLM/HPC.

EmanuelaBoros commented 1 year ago

For 13b you need 3 gpus and for the 70b, 8 gpus.

vasili111 commented 1 year ago

@EmanuelaBoros For 13b documentation says that it should be --nproc_per_node 2. This means 2 GPUs, right?

EmanuelaBoros commented 1 year ago

@vasili111 Maybe this is true for the one provided by Meta with the download script. I wish I could tell you more, but if you look here, the model is split in 3. At least, this is how it works on my side.

My mistake for the 70b model - might be 15Gpus.

jjjunyeong commented 1 year ago

@vasili111 Having same problem here. 13b-chat model worked fine when I tested it on a server with 8 80G A100 GPUs, while utilizing only 2 GPUs. But when I downloaded the model to a server with 2 80G A100 GPUs, 13b-chat model suddenly stopped working. 7b-chat model is still working fine.

HamidShojanazeri commented 1 year ago

As I mentioned in this particular example 13B need only 2 gpus,

HamidShojanazeri commented 1 year ago

@raghu-007 and @HamidShojanazeri

Thank you for your replies and help.

I checked CUDA and Pytorch installation and Pytorch sees two GPUs and is able to run this code:

import torch x = torch.rand(5, 3) print(x) I am using CUDA-11.8.0 and cuDNN/8.9.2.26. Those are versions that are recommended as compatible by admin of HPC computer that I am using. Please let me know if this can be an issue.

It seems the problem with 70B model is as @HamidShojanazeri suggested the number of GPUs that are available (two GPUs) and asking with torchrun --nproc_per_node 8 to use 8 GPUs.

But I am also not able to run 13B model where I have 2 GPUs and I am asking to use 2 GPUs with --nproc_per_node 2 like this:

torchrun --nproc_per_node 2 example_chat_completion.py \ --ckpt_dir llama-2-13b-chat/ \ --tokenizer_path tokenizer.model \ --max_seq_len 512 --max_batch_size 6 The output I am getting:

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 14.85 seconds and nothing else happens. So here I have 2 GPUs and asking to use 2 GPUs but it does not works any way. What can be problem?

@HamidShojanazeri

with fp16 might be able t o pull it off

Could you please clarify what exactly fp16 means here? Sorry, new to DL/LLM/HPC.

I am not able to repo this @vasili111 but seem model got loaded, can you add more logging in the generate function to get more info.

With FP16, I meant half precision, which in fact already model is in half precision.

epignatelli commented 1 year ago

@HamidShojanazeri is the number of GPU the exact requirement or is it the amount of memory?

For example, one 80Gb GPU would fit the 13Gb model, I assume, but it would not respect the 2x GPU requirements.

Would having only 1 GPU be a limiting case here? If so, is there a way around it?

hjw1 commented 1 year ago

@EmanuelaBoros ,Thank you. It really helps!

meta-llama / llama

I am successfully running "llama-2-7b-chat" but have problems with "llama-2-13b-chat" and "llama-2-70b-chat" #834

keep the Cuda device to the GPU, Example: GPU 0