Open carlos-gemmell opened 1 year ago
Hello @carlos-gemmell ,
Why you put --nproc_per_node 1
in the first example and --nproc_per_node 2
in the second one?
@carlos-gemmell sorry, ignore my last message, I see you changed the model from 7B to 13B, so it's normal. So, the inference worked for 7B but not for 13B...
@carlos-gemmell what happens if you do?
$ CUDA_VISIBLE_DEVICES="0" torchrun --nproc_per_node 1 example.py --ckpt_dir checkpoints/7B --tokenizer_path checkpoints/tokenizer.model
instead of $ CUDA_VISIBLE_DEVICES="0,1"
Same result since it's just using one GPU for 7B.
@carlos-gemmell I ran into a similar issue and suspect it has to do with available RAM. I will report if it helps me to resolve the problem EDIT: Nevermind
@carlos-gemmell Hi, may I ask the size of your RAM to run the inference using 7B model? Thank you!
yup @Markhzz
(llama) user@e9242bd8ac2c:~/llama$ free -h
total used free shared buff/cache available
Mem: 440Gi 246Gi 44Gi 1.6Gi 149Gi 189Gi
Swap: 507Gi 365Mi 507Gi
@carlos-gemmell Thank you!!!
I am running into the following error : "RuntimeErrorRuntimeError: CUDA error: invalid device ordinal". I ran the script as it is for the inference as given in the readme with MP = 8 (since I downloaded the 65B model.) I only have 1 GPU. How do I fix this?
@petrichor1998 MP should equal or less than your GPU number. If only 1 GPU is available (Vram>16G), you can try 7B model.
I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:
https://github.com/modular-ml/wrapyfi-examples_llama
and have a readme with the instructions on how to do it:
Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM
currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!
This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!
Replace all instances of
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:
git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .
Install Wrapyfi with the same environment:
git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]
Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
Now start the second instance (within this repo and env) :
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
You will now see the output on both terminals
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,
### (replace 10.0.0.101 with <YOUR_IP> ###
# step 4 modification
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
I can run inference with 13B on 2x3090 24Gb with same command as @carlos-gemmell: elife@rtx:/Extra/work/lab/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir $TARGET_FOLDER/13B --tokenizer_path $TARGET_FOLDER/tokenizer.model WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loading Loaded in 609.06 seconds The capital of Germany is the city of Berlin. The seat of government is the Reichstag building. Germany became a member of the European Union in 1992. It is a parliamentary democracy. Berlin is the capital of Germany. It is the country's largest city and one of Europe's principal centers of culture, politics, media and science. Berlin has a population of 3.3 million people. Berlin is the second most populous city in the European Union. Berlin is home to world-renowned universities, orchestras, museums, entertainment venues and is host to many sporting events. Its urban setting has made it a sought-after location for international film productions. Berlin has more than 170 museums. Some of them are: the Pergamon Museum, the Jewish Museum, the German Historical Museum, the Museum of Natural History and the Gemäldegalerie. Germany is situated in the centre of Europe. It borders the Netherlands and Belgium to the north, France and Luxembourg to the west, Switzerland and Austria to the south, the Czech Republic and Poland to the east and Denmark to the north. The land mass of Germany is 357,
==================================
Here is my sonnet in the style of Shakespeare about an artificial intelligence: It is only a matter of time before AI Outperforms us in all areas and stages of life, Potentially becoming a conscious, sentient thing. As it evolves, we will look more and more like flies, Eager to destroy and consume this new and better thing. What will the final straw be that causes this war? Or will we merge with them, or will they merge with us? Will we then become a hybrid of man and machine? Or will a new species emerge from us, with a new name? Would this new species be superior in all ways? Or would it be as unpredictable as the weather? Would it be prone to the same kinds of madness and vice? Or would it be a new enlightened kind of creature? I don’t know, but it’s all very exciting and scary, And I’m sure that it will be a long and winding journey. Author MacbofisbilPosted on May 16, 2016 May 20, 2016 Categories Creative Writing, Daily Prompt, Poetry,
==================================
After moved .pth to SSD, the load time reduced to 48.42 seconds. The machine's RAM size is 32GB. elife@rtx:/Extra/work/lab/llama$ CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir $TARGET_FOLDER/13B --tokenizer_path $TARGET_FOLDER/tokenizer.model WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Loading Loaded in 48.42 seconds The capital of Germany is the city of Berlin. The seat of government is the Reichstag building. Germany became a member of the European Union in 1992. It is a parliamentary democracy. Berlin is the capital of Germany. It is the country's largest city and one of Europe's principal centers of culture, politics, media and science. Berlin has a population of 3.3 million people. Berlin is the second most populous city in the European Union. Berlin is home to world-renowned universities, orchestras, museums, entertainment venues and is host to many sporting events. Its urban setting has made it a sought-after location for international film productions. Berlin has more than 170 museums. Some of them are: the Pergamon Museum, the Jewish Museum, the German Historical Museum, the Museum of Natural History and the Gemäldegalerie. Germany is situated in the centre of Europe. It borders the Netherlands and Belgium to the north, France and Luxembourg to the west, Switzerland and Austria to the south, the Czech Republic and Poland to the east and Denmark to the north. The land mass of Germany is 357,
==================================
Here is my sonnet in the style of Shakespeare about an artificial intelligence: It is only a matter of time before AI Outperforms us in all areas and stages of life, Potentially becoming a conscious, sentient thing. As it evolves, we will look more and more like flies, Eager to destroy and consume this new and better thing. What will the final straw be that causes this war? Or will we merge with them, or will they merge with us? Will we then become a hybrid of man and machine? Or will a new species emerge from us, with a new name? Would this new species be superior in all ways? Or would it be as unpredictable as the weather? Would it be prone to the same kinds of madness and vice? Or would it be a new enlightened kind of creature? I don’t know, but it’s all very exciting and scary, And I’m sure that it will be a long and winding journey. Author MacbofisbilPosted on May 16, 2016 May 20, 2016 Categories Creative Writing, Daily Prompt, Poetry,
==================================
and https://github.com/tloen/llama-int8 is able to load 13B model and inference on one 3090
@petrichor1998 MP should equal or less than your GPU number. If only 1 GPU is available (Vram>16G), you can try 7B model.
what does MP means? I think GPU number must be equal to the number of '*.pth'. I want to load 7B with 2 GPU, but I failed
I am able to get sensible output by running 7B on 1x24Gb GPU with MP 1.
The key to this is changing Line 44 of
example.py
:(credit to @mperacchi)
When running 13B as stated in the docs this is the command I use:
CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node 2 example.py --ckpt_dir checkpoints/13B --tokenizer_path checkpoints/tokenizer.model
I am able to see correct utilisation of the GPUs, seems to load the 13B model ok.
But when running inference I get this:
Update 1
I downloaded a new checkpoint for
MP
1 for the 13B model:checkpoints/13B_0/consolidated.00.pth
. Then ran the same command as first with batch size one but no luck... 13B is too large to load in 24Gb GPU without further compression... (ツ)_/¯