meta-llama / llama

Inference code for Llama models
Other
55.5k stars 9.47k forks source link

Able to load 13B model on 2x3090 24Gb! But not inference... :( #61

Open carlos-gemmell opened 1 year ago

carlos-gemmell commented 1 year ago

I am able to get sensible output by running 7B on 1x24Gb GPU with MP 1.

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Loaded in 11.71 seconds
The capital of Germany is the city of Berlin. Berlin is one of the most important cities in Europe...

The key to this is changing Line 44 of example.py:

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=32, **params) # OLD
model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=8, **params) # NEW

(credit to @mperacchi)

When running 13B as stated in the docs this is the command I use: CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model

I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   36C    P2   131W / 350W |  17721MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:23:00.0 Off |                  N/A |
| 30%   34C    P2   135W / 350W |  17721MiB / 24576MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

But when running inference I get this:

(llama) user@e9242bd8ac2c:~/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading              Loaded in 11.82 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3874515) of binary: /home/user/miniconda3/envs/llama/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/llama/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
example.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 3874516)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874516
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_14:52:14
  host      : e9242bd8ac2c
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 3874515)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 3874515
=======================================================

Update 1

I downloaded a new checkpoint for MP 1 for the 13B model: checkpoints/13B_0/consolidated.00.pth. Then ran the same command as first with batch size one but no luck... 13B is too large to load in 24Gb GPU without further compression... (ツ)_/¯

andrewssobral commented 1 year ago

Hello @carlos-gemmell , Why you put --nproc_per_node 1 in the first example and --nproc_per_node 2 in the second one?

andrewssobral commented 1 year ago

@carlos-gemmell sorry, ignore my last message, I see you changed the model from 7B to 13B, so it's normal. So, the inference worked for 7B but not for 13B...

andrewssobral commented 1 year ago

@carlos-gemmell what happens if you do?

$ CUDA_VISIBLE_DEVICES="0" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model

instead of $ CUDA_VISIBLE_DEVICES="0,1"

carlos-gemmell commented 1 year ago

Same result since it's just using one GPU for 7B.

LucWeber commented 1 year ago

@carlos-gemmell I ran into a similar issue and suspect it has to do with available RAM. I will report if it helps me to resolve the problem EDIT: Nevermind

Markhzz commented 1 year ago

@carlos-gemmell Hi, may I ask the size of your RAM to run the inference using 7B model? Thank you!

carlos-gemmell commented 1 year ago

yup @Markhzz

(llama) user@e9242bd8ac2c:~/llama$ free -h
              total        used        free      shared  buff/cache   available
Mem:          440Gi       246Gi        44Gi       1.6Gi       149Gi       189Gi
Swap:         507Gi       365Mi       507Gi
Markhzz commented 1 year ago

@carlos-gemmell Thank you!!!

petrichor1998 commented 1 year ago

I am running into the following error : "RuntimeErrorRuntimeError: CUDA error: invalid device ordinal". I ran the script as it is for the inference as given in the readme with MP = 8 (since I downloaded the 65B model.) I only have 1 GPU. How do I fix this?

BruceStayHungry commented 1 year ago

@petrichor1998 MP should equal or less than your GPU number. If only 1 GPU is available (Vram>16G), you can try 7B model.

fabawi commented 1 year ago

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

  1. Replace all instances of and before running the scripts

  2. Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

    git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
    cd wrapyfi-examples_llama
    pip install -r requirements.txt
    pip install -e .
  3. Install Wrapyfi with the same environment:

    git clone https://github.com/fabawi/wrapyfi.git
    cd wrapyfi
    pip install .[pyzmq]
  4. Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

    cd wrapyfi/standalone 
    python zeromq_proxy_broker.py --comm_type pubsubpoll
  5. Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

    CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
  6. Now start the second instance (within this repo and env) :

    CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
  7. You will now see the output on both terminals

  8. EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

    ### (replace 10.0.0.101 with <YOUR_IP> ###
    
    # step 4 modification 
    python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
    
    # step 5 modification
    CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
    
    # step 6 modification
    CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
elife33 commented 1 year ago

I can run inference with 13B on 2x3090 24Gb with same command as @carlos-gemmell: elife@rtx:/Extra/work/lab/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir $TARGET_FOLDER/13B --tokenizer_path $TARGET_FOLDER/tokenizer.model WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loading Loaded in 609.06 seconds The capital of Germany is the city of Berlin. The seat of government is the Reichstag building. Germany became a member of the European Union in 1992. It is a parliamentary democracy. Berlin is the capital of Germany. It is the country's largest city and one of Europe's principal centers of culture, politics, media and science. Berlin has a population of 3.3 million people. Berlin is the second most populous city in the European Union. Berlin is home to world-renowned universities, orchestras, museums, entertainment venues and is host to many sporting events. Its urban setting has made it a sought-after location for international film productions. Berlin has more than 170 museums. Some of them are: the Pergamon Museum, the Jewish Museum, the German Historical Museum, the Museum of Natural History and the Gemäldegalerie. Germany is situated in the centre of Europe. It borders the Netherlands and Belgium to the north, France and Luxembourg to the west, Switzerland and Austria to the south, the Czech Republic and Poland to the east and Denmark to the north. The land mass of Germany is 357,

==================================

Here is my sonnet in the style of Shakespeare about an artificial intelligence: It is only a matter of time before AI Outperforms us in all areas and stages of life, Potentially becoming a conscious, sentient thing. As it evolves, we will look more and more like flies, Eager to destroy and consume this new and better thing. What will the final straw be that causes this war? Or will we merge with them, or will they merge with us? Will we then become a hybrid of man and machine? Or will a new species emerge from us, with a new name? Would this new species be superior in all ways? Or would it be as unpredictable as the weather? Would it be prone to the same kinds of madness and vice? Or would it be a new enlightened kind of creature? I don’t know, but it’s all very exciting and scary, And I’m sure that it will be a long and winding journey. Author MacbofisbilPosted on May 16, 2016 May 20, 2016 Categories Creative Writing, Daily Prompt, Poetry,

==================================

elife33 commented 1 year ago

After moved .pth to SSD, the load time reduced to 48.42 seconds. The machine's RAM size is 32GB. elife@rtx:/Extra/work/lab/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir $TARGET_FOLDER/13B --tokenizer_path $TARGET_FOLDER/tokenizer.model WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loading Loaded in 48.42 seconds The capital of Germany is the city of Berlin. The seat of government is the Reichstag building. Germany became a member of the European Union in 1992. It is a parliamentary democracy. Berlin is the capital of Germany. It is the country's largest city and one of Europe's principal centers of culture, politics, media and science. Berlin has a population of 3.3 million people. Berlin is the second most populous city in the European Union. Berlin is home to world-renowned universities, orchestras, museums, entertainment venues and is host to many sporting events. Its urban setting has made it a sought-after location for international film productions. Berlin has more than 170 museums. Some of them are: the Pergamon Museum, the Jewish Museum, the German Historical Museum, the Museum of Natural History and the Gemäldegalerie. Germany is situated in the centre of Europe. It borders the Netherlands and Belgium to the north, France and Luxembourg to the west, Switzerland and Austria to the south, the Czech Republic and Poland to the east and Denmark to the north. The land mass of Germany is 357,

==================================

Here is my sonnet in the style of Shakespeare about an artificial intelligence: It is only a matter of time before AI Outperforms us in all areas and stages of life, Potentially becoming a conscious, sentient thing. As it evolves, we will look more and more like flies, Eager to destroy and consume this new and better thing. What will the final straw be that causes this war? Or will we merge with them, or will they merge with us? Will we then become a hybrid of man and machine? Or will a new species emerge from us, with a new name? Would this new species be superior in all ways? Or would it be as unpredictable as the weather? Would it be prone to the same kinds of madness and vice? Or would it be a new enlightened kind of creature? I don’t know, but it’s all very exciting and scary, And I’m sure that it will be a long and winding journey. Author MacbofisbilPosted on May 16, 2016 May 20, 2016 Categories Creative Writing, Daily Prompt, Poetry,

==================================

elife33 commented 1 year ago

and https://github.com/tloen/llama-int8 is able to load 13B model and inference on one 3090

littletrain-jyp commented 1 year ago

@petrichor1998 MP should equal or less than your GPU number. If only 1 GPU is available (Vram>16G), you can try 7B model.

what does MP means? I think GPU number must be equal to the number of '*.pth'. I want to load 7B with 2 GPU, but I failed