Closed josephrocca closed 2 months ago
likely it was due to config being compiled for higher TP(leveraging multiple gpus) but you don't have as enough gpu
yeah the model was compiled for 4x GPU. I'll upload ones for 2x GPU in a bit or you can compile them yourself (it's not hard, just a bit inconvenient for the bigger models)
Ah, thank you both.
@bayley A 2x one would be great if it's not too much trouble! Or can I just change the tensor_parallel_shards
values in the config file to 2? I just tried that, and also reduced all the 8192
values relating to context length (i.e. everything except hidden_size
) down to 4096
, but am running into:
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 20584.716 MB, which is less than the sum of model weight size (18643.766 MB) and temporary buffer size (2977.048 MB).
which seems a bit strange because there should be ~24GB of memory (i.e. significantly more than 20584.716 MB) available in each of the two 4090 GPUs. So my guess here is that simply changing the tensor_parallel_shards
values in the config file is not a valid approach.
oh hmm, I just realized the model quants are independent of the TP rank outside of the config file, so something else is going on here - @tqchen is --overrides "tensor_parallel_shards=2" the correct flag to set the number of TP shards during compile time?
Is --overrides "tensor_parallel_shards=2" the correct flag to set the number of TP shards during compile time?
@bayley Yes that's true. Sorry for the late response.
The chat config in the HF repo was generated with TP=4, besides the TP override during compile do any other edits need to be made to the config for TP=2?
🐛 Bug
I found this repo on Huggingface, kindly publicly shared by @bayley who also provided the commands for serving, but upon using those commands, I get an error:
CUDA: invalid device ordinal
. I tried usingmlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
instead and it works fine.Could it be that each version of mlc requires a new MLC model conversion? If that is likely the cause, then:
To Reproduce
Environment
This Docker image:
conda
, source): See above.pip
, source): See above.nvidia-smi
within the Docker container