johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Finetune Llama 65B with dual RTX 3090 #91

Open juanps90 opened 1 year ago

juanps90 commented 1 year ago

Hello, and thank you very much for all your work.

I was wondering if it was possible to split vram requirements between 2 RTX3090s to finetune and later run the LLaMA 65B model in a way that doesn't OOM?

johnsmith0031 commented 1 year ago

Use this: load_llama_model_4bit_low_ram_and_offload Set max_memory = {0: '18Gib', 1: '18Gib', 'cpu': '48Gib'}

ehartford commented 1 year ago

I am trying the same thing; I just got my dual 3090s installed.

wesleysanjose commented 1 year ago

@ehartford curious are both cards connected via nvlink? if so, what's the MB you use? I just got a 3090 and considering to get another one for nvlink setup

ehartford commented 1 year ago

Yeah I have dual 3090 nvlink I have a dual xeon motherboard https://www.asrockrack.com/general/productdetail.asp?Model=EP2C612%20WS#Specifications Problem is that it's designed for 2 slot spacing so I had to use risers

juanps90 commented 1 year ago

Yeah I have dual 3090 nvlink I have a dual xeon motherboard https://www.asrockrack.com/general/productdetail.asp?Model=EP2C612%20WS#Specifications Problem is that it's designed for 2 slot spacing so I had to use risers

Daaaamnm 7 PCIE slots. You could totally setup a 4 way 3090 setup.

ehartford commented 1 year ago

With some kind of external power supplies and enclosures yeah that would be awesome 😎

On Sun, Apr 30, 2023, 2:09 PM juanps90 @.***> wrote:

Yeah I have dual 3090 nvlink I have a dual xeon motherboard https://www.asrockrack.com/general/productdetail.asp?Model=EP2C612%20WS#Specifications Problem is that it's designed for 2 slot spacing so I had to use risers

Daaaamnm 7 PCIE slots. You could totally setup a 4 way 3090 setup.

— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/91#issuecomment-1529140096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BOXKFQELWHL5ZTNIATXD3IJRANCNFSM6AAAAAAXI2HHQY . You are receiving this because you were mentioned.Message ID: @.***>

wesleysanjose commented 1 year ago

Yeah I have dual 3090 nvlink I have a dual xeon motherboard https://www.asrockrack.com/general/productdetail.asp?Model=EP2C612%20WS#Specifications Problem is that it's designed for 2 slot spacing so I had to use risers

this looks awesome, how does it look like when you run finetune? i just bought a 3090 fro Craigslist today and it can run 4bit finetune with 64(batch) 8(mbatch) with 23G vmem consumed and it shorten the alcapa dataset finetune within 12 hours for 1 epoch.

i am thinking of to get another 3090 with nvlink if it increases speed further

tensiondriven commented 1 year ago

I'm not sure how much bonus you''ll get from another 3090; depending on your financial situation, I don't think its worth it. You might want another 3090 if you want to do inference and training at the same time, or run multiple models or models with different loras for simulation / multi-agent.

Note that when you train on two GPU's, only one GPU does work at a time. (At least that's how it is on my setup.). And if you want it to be fast, it's important that both cards be connected at Gen 4 16x.

tensiondriven commented 1 year ago

@ehartford Post pics of your rig somewhere, if you can :) I've also got dual 3090 but the slot spacing is off.

ehartford commented 1 year ago

064d80ba-ee18-4a95-90a3-d14a2182e442.jpg

5e26e1eb-8ba1-4437-bec8-a26373f9f13e.jpg

tensiondriven commented 1 year ago

image

image

I 3D printed a custom vent cover for the back, and am using an 80mm x 38mm PWM (I think?) fan, connected to the motherboard. Using a motherboard temp reading, the fan curves are set such that these actually stay surprisingly cool. This type of fan is hella powerful, and I never even have to run it at full power.

I'm pretty proud of the cooling solution. The problem is the mobo only supports Gen3x4 on the second card.

Oh, I also added a custom RGB matrix panel on the inside, facing down at the cards, and wrote a script on the proxmox host that uses openRGB to make the LED's brighter when the cards are under load.

image

You know, just to reduce airflow and add a little more heat to the mix.

wesleysanjose commented 1 year ago

I am curious with or without the nvlink, what's the difference when finetune alpaca? it now takes 12 hours for my single 3090 to complete one epoch with 64 batch and 8 mbatch which is 5 times faster than the origin 2060s which can only run 2 batch and 1 mbatch

tpfwrz commented 1 year ago

Yeah I have dual 3090 nvlink I have a dual xeon motherboard https://www.asrockrack.com/general/productdetail.asp?Model=EP2C612%20WS#Specifications Problem is that it's designed for 2 slot spacing so I had to use risers

@ehartford what risers are you using if you don't mind me asking?

3dluvr commented 1 year ago

How do these risers help though - wouldn't you need a special/specific case to mount the cards in vertical/upright orientation, or something similar?

I have an Obsidian 900D case with water cooled everything, and am having a problem fitting another 3090 because to satisfy 4 slot spacing I would need to use the last PCIe slot. That last slot when a card is in it with a waterblock (or stock HSF) covers all of the peripheral ports on my motherboard (Supermicro X10DAi). Talk about motherboard designers not having a foresight for dual graphics cards setups.

shawei3000 commented 1 year ago

I am trying to finetune 65B 4bit llama model (with 2 48GB GPU , no Nvlink), following the advice from jognsmith0031:

Use this: load_llama_model_4bit_low_ram_and_offload groupsize=-1, is_v1_model=False, max_memory = {0:'43Gib', 1:'43Gib', 'cpu':'48Gib'}

I have 7k traning dataset, finetune is fine but I must truncate input token to 256, otherwise CUDA out of memory, however while checking nvidia-smi , it indicates that 2nd GPU is not (never) used at all, any reason why? and do I really need nvlink to have 2 GPU both in use for fine tuning?

johnsmith0031 commented 1 year ago

Try setting max_memory = {0: '18Gib', 1: '18Gib', 'cpu': '48Gib'}

shawei3000 commented 1 year ago

@johnsmith0031 , can NOT than you enough for your suggest above, works! with 2 GPU 48GB each, I can fine tune lora with max 768 input tokens... No nvlink/bridge is acutrally need, training speed is identical with single GPU training with 256 tokens... Thnx! John!!!!