turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Unable to split across multiple AMD GPUs #208

Closed TNT3530 closed 1 year ago

TNT3530 commented 1 year ago

When attempting to -gs across multiple Instinct MI100s, the model is loaded into VRAM as specified but never completes. It seems that the model gets loaded, then the second GPU in sequence gets hit with a 100% load forever, regardless of the model size or GPU split. Even with 4 cards, only the second one gets usage and the benchmark never starts the warmup pass.

Ive tried both LLaMA 2 and Pyg 13 for models, different gs splits, spreading across more cards, nothing seems to work

GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 43.0c 42.0W 300Mhz 1200Mhz 0% auto 290.0W 25% 0% 1 62.0c 111.0W 1502Mhz 1200Mhz 0% auto 290.0W 25% 100% 2 50.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 25% 0% 3 49.0c 43.0W 300Mhz 1200Mhz 0% auto 290.0W 27% 0%

jmoney7823956789378 commented 1 year ago

Never seen the MI100 before, also never seen this issue pop up with my MI60s. I'm assuming you followed the rentry guide for AMD?

TNT3530 commented 1 year ago

Never seen the MI100 before, also never seen this issue pop up with my MI60s. I'm assuming you followed the rentry guide for AMD?

I didnt follow an exact guide, installed it myself. Inference works fine, albeit very slow for the rated specs, just splitting doesn't work. It also doesn't seem to unload from ram once loaded. Its possible Proxmox is just misreporting usage

jmoney7823956789378 commented 1 year ago

Yeah, poor perf is unfortunately common with AMD/ROCm right now. One of us needs to get an MI card into turbo's hands eventually.

TNT3530 commented 1 year ago

Fixed by