turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Lazy loading of 2 models give CUDA out of memory #521

Closed waterangel91 closed 14 hours ago

waterangel91 commented 3 days ago

I wonder if there is efficient way to load 2 different models. I am using the new async generator, and my code can generate response from 2 different models at the same time just fine.

The problem is when using lazy load, the 2nd model will run out of memory and will not automatically accounted for the memory that already ultilized by the first model.

The line of code that show error is the cache:

cache = ExLlamaV2Cache_Q4(self.model, max_seq_len=self.max_tokens, lazy=not self.model.loaded)

So i need to manually allocate the GPU among the cards, and it seems to be a delicate act and seems like i must leave enough room on GPU:0? (I could be wrong on this)

Is there a good way to go about allocating GPU memory for 2 models?

Also another question is when i generate response from 2 models at the same time, the combined speedis a bit strange: Seperate generation: Mistral 7B 150tok/s, Yi-34B 32tok/s Generate at the same time: Mistral 7B 32tok/s, Yi-34B 25tok/s

What i was surprised is the generation speed for mistral reduced alot compare to Yi-34B, so the reduction in speed when generating together is not linear?

If i load 2 separate endpoint, each one run one model on one port then the speed is retained. What could i do wrong that the speed was slowed down when i combine the models into 1 endpoint? could it be the cache was shared?

turboderp commented 3 days ago

It would depend if you're using both models at the same time (i.e. in concurrent threads) or interleaving calls to two separete generators. The autosplit loader works by performing a forward pass of max_input_len tokens as the layers are loaded, until an OoM exception tells it the first GPU is maxed out, then it starts loading layers to the next GPU and so on.

After loading one model like this, it's "guaranteed" (more or less) that the VRAM remaining on each GPU is enough for that model. But it won't be enough if you start filling that free space with weights or buffers for a second model. It should still be safe to load a small model (like a draft model), followed by a large model, as long as they're not used concurrently.

Generally speaking, predicting VRAM usage exactly is a very hard problem due to the libraries used. Both CUDA/ROCm and Torch have their own internal memory allocation/caching schemes, then there's whatever is used by flash-attn or xformers, which could all change from version to version, or change with new driver versions or different hardware.

One thing you could do is try increasing the reserve_vram argument to load_autosplit. The type hint is wrong, I just noticed, it's supposed to be a list[int] of how many bytes to reserve on each GPU in addition to what's needed for the reference forward pass. That's still somewhat manual of course, and not ideal.

I'm not sure if there's a better approach to automatically splitting one or more models across multiple GPUs, but I'd happily take suggestions. I don't think it's too different from trying to use every last byte of system RAM which is also inherently problematic outside of embedded systems.

As for the performance dropping off nonlinearly, that's not too strange, I think. If you imagine running the two generators in the same thread, that would look something like:

results = generator1.iterate()  # latency: 7 ms
results += generator2.iterate()  # latency: 31 ms
process(results)

This would have a total latency of 38 ms and reduce the speed of both generators to 26 t/s. You seem to be running the models in separate threads, and then it becomes a question of how reliably you can schedule about 5 forward passes from one thread over the course of 1 forward pass from the other thread.

The Python GIL would be the first obstacle there, I guess. Another concern is that CUDA is pretty bad at context switching. It loves powering through a long queue of kernels or a graph, but it will stutter if the flow gets interrupted at any point, like if two threads are competing for resources. And I can easily picture how, even if you're multithreaded on the CPU side, your threads just end up taking turns queuing up entire forward passes and then running more or less sequentially anyway, since there are synchronization points at the start and end of each forward pass.

If you haven't already, you'll at least want to make sure each thread is using its own CUDA stream. PyTorch has a per-thread default stream option that you can enable with an environment variable:

export PYTORCH_CUDA_ALLOC_CONF=per_thread_default_stream

But it's not something I've done a lot of testing on. Worth a shot, I guess.

waterangel91 commented 3 days ago

Thanks for your answer. I am learning alot from you. I will test try out your suggestions on the model loading part.

On the generation part, if I understand you correctly, the reason why speed is much faster when i spin up 2 separate fastapi servers for each model is because it running in separate thread instead of 1 single thread if i combine the fastapi because python is single threaded in nature?

In the case of LLM, i thought things are mostly constrained by GPU, so if i load the models on different GPUs, probably running model in different thread would be ok? (I will test your suggestion on the PYTORCH_CUDA_ALLOC_CONF as well)

I am running 14900K with 16 core 32 threads if this info matter. I really new to python threading and glad to learn more from u on this.

waterangel91 commented 2 days ago

So today i loaded a 3rd model, then the error come again. I have enough GPU as i load this 3rd model on a card that was totally unused so far.

Here is the exact error log in case, you are able to tell something from here.

  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/exllamav2/generator/dynamic.py", line 924, in iterate
    self.iterate_gen(results)
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/exllamav2/generator/dynamic.py", line 1091, in iterate_gen
    torch.cuda.synchronize()
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 800, in synchronize
    with torch.cuda.device(device):
  File "/home/remichu/miniconda3/envs/mlenv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 374, in __enter__
    self.prev_idx = torch.cuda._exchange_device(self.idx)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Update: Setting the below option like, but seems like it doesnt help as well

export PYTORCH_CUDA_ALLOC_CONF=per_thread_default_stream
waterangel91 commented 14 hours ago

So in the end i could I couldnt resolve the issue with loading models. I load 3 different models on 3 different cards, each of the model fit nicely on the card and i manually set the cpu split. But some how it end up having error.

I resolve it by building a docker immage with flash attention and exllamav2. and when i run the container, i specify the specify gpus by --gpus '"device=1"' and managed to resolve the issue

Though i hope in the future there is more convenient way that exllamav2 will support multi model loading. Espicially since there are more and more modality coming. The use case to load multiple model e.g. vision, tts, llms at the same time will be a valid use case. Cheer