turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Multi-GPU issues #281

Open nktice opened 1 year ago

nktice commented 1 year ago

Here's another bug on Oobabooga's project that is unresolved... https://github.com/oobabooga/text-generation-webui/issues/2923 I realized that the ExLlama team may have a solution.... So thought I'd cross post this issue on this project, in case you've not seen.

Here's the guide I wrote to get everything working on AMD kit... https://github.com/nktice/AMD-AI Models load fine when it is only on one card, here are some results : https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md

Multi-card loading only spits out gibberish, here's an example :

pha golden Riv. Jcatred (ProcSN proc Dre -:// Mindly means for the and in a Nich říct Forest Rav Rav fran fran fran gaz Agrcastle castleasiacliordinate advers Mem advers Basibenkooor paste Singapore refugeermeanny intellectualsafe Shakespe contempor Mallmanual Quantmousektr Ge Mil shadownehfdzekADmobile Und Euenf Next Dominbuchcock Infoengo‭ Hann NAT ]] Ferr' -.-- -,-

    ason, rang,-, –-

(,,

--,.,

alter

,-

(

-on,-.

I,- .

1

V

V. film-

N

    –on.,on,.

(, for.

and of- is. . and –on, –,. and

In in

film school and I on and with and I ":

.

` andon util –
Ph0rk0z commented 1 year ago

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

nktice commented 1 year ago

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

Thanks for your reply... I've raised the issue on HIPs github support thread : https://github.com/ROCm-Developer-Tools/HIP/issues/3331

turboderp commented 1 year ago

Just in case you haven't tried it yet, the --gpu_peer_fix argument (corresponding entry in ExLlamaConfig) might help. Maybe? It prevents direct inter-device copying even when the driver reports that the capability is there, and copies everything via system RAM instead. There have been some issues with that on NVIDIA at least.

nktice commented 1 year ago

Thanks for your reply, and your excellent coding, it's great when it works...

I looked into this, and had trouble finding how to do such a thing... Whereas such features would be good to have options through their interface, I have requested Oobabooga add features so it can be done : https://github.com/oobabooga/text-generation-webui/issues/3912

I have been looking for ( but have yet to find again ) a page that I found... in it they discussed that similar issues came from torch.empty as sometimes it had not cleared all of the data leading to issues - in it they suggest using torch.zeros instead, which helped some people. I went through your code, and tried that for my issue, to little avail - but thought I'd mention in case you've not heard of it, and it helps others. [ If I find that page, I'll update this to include a proper link then. ]

turboderp commented 1 year ago

Yep, torch.empty isn't supposed to clear the data, which could cause problems if you're incorrectly assuming that an empty tensor is the same as a zeros tensor, but I think I've been mindful enough of the distinction.

--gpu_peer_fix is only a kludge to work around a particular bug in Torch (or CUDA, or the NVIDIA driver, or whatever the case may be). So it's not really a solution or anything, more a diagnostic tool, and the solution would be filing a bug report upstream if that flag fixes something that shouldn't be broken.

I'm thinking another thing to explore would be the use of at::cuda::OptionalCUDAGuard to ensure that the correct CUDA device is selected on entry to each of the extension functions. If that doesn't get properly HIPified, it could lead to ROCm working correctly on single-GPU setups but failing (perhaps even sporadically) on multi-GPU setups.

nktice commented 1 year ago

I got a reply on Oobabooga posting about the passing of parameters such as the one you suggest. "It's on by default" https://github.com/oobabooga/text-generation-webui/issues/3912?notification_referrer_id=NT_kwDOADJb4bI3NzExMzE3NDk4OjMzMDAzMjE#issuecomment-1719362059

Thanks for your replies... I have been thinking of this, so I'll mention it. Another issue, that is somewhat related to model loading is cache and other memory that is involved in handling models... So for example, forum commenters noted to use split settings that leave lots of room for caching and memory use around models - As model tokens, caching, and management bits consume lots of space.
[ bigger the model, the more is used for tokens, and index info... ] I am wondering if there is a loader option to describe these bits [ Command line option, benchmark tool parameter, or something like that... ] so one can predict the whole memory footprint a model will use?

Related to this, instead of splitting model across GPUs, is it practical to have supporting features on another card? This would allow for maximizing model size loaded on one card so as to avoid an issue like I'm having splitting up model, using 2nd card, ( or perhaps the system RAM ) for caching / tokens.
As tokens go up, supplemental memory could be more helpful.

turboderp commented 1 year ago

Cache and state has to reside on the same device as the associated weights. You can't do CUDA operations across devices, and while you could store just the cache on a separate device, it would be slower than just swapping it to system RAM, which is still slow enough to be kind of useless.

ardfork commented 1 year ago

Guess, I forgot to answer here, this is the same issue as #173 which was fixed upstream and will be available in next ROCm version.

Note that exllama v2 is also affected and this could have easily been fixed locally in exllama with a small hack like it was done in llama.cpp, but I didn't have the hardware to test.

nktice commented 8 months ago

I can now report, that using latest drivers, it seems to work now. As in I can load a model cross-GPUs and it's responsive.
[ ROCm 6.0 , torch==2.3.0.dev20240118+rocm6.0 ]