GPTQ Formats that work (and don't)

ssmi153 commented 1 year ago

Now that we can load GPTQ files that haven't been quantized by TGI's quantization script, I thought I'd do a set of tests to see which formats work and which don't. I'm using https://huggingface.co/TheBloke/OpenOrca-Preview1-13B-GPTQ as an example set.

1) The 'most compatible' format ([main] branch) doesn't work. This throws the following error: RuntimeError: weight model.layers.0.self_attn.q_proj.g_idx does not exist 2) Fortunately, the other formats provided by TheBloke do seem to work. In particular: gptq-4bit-128g-actorder_True definitely loads correctly. To use this, you need to set the following environment variables: GPTQ_BITS = 4, GPTQ_GROUPSIZE = 128 (matching the groupsize of the quantized model). Additionally, you need to pass in REVISION = gptq-4bit-128g-actorder_True to pull the correct version of this model (rather than the default version which still doesn't work).

So overall this is great news - we can now load GPTQ files that other people have converted rather than relying on the inbuilt quantizer in TGI!

TheBloke commented 1 year ago

That's great to hear. I started adding those extra quant formats recently with software like TGI and ExLlama in mind.

To the developers of the TGI GPTQ code I'd like to ask: is there any chance you could add support for the quantize_config.json file? It's produced automatically by AutoGPTQ when making a quantisation, and I provide it with every one of my GPTQ files, even the ones made with GPTQ-for-LLaMa. It contains all the GPTQ parameters, and it could easily be used as a source for the GPTQ params, saving the user the need to set them manually via env vars.

Here's an example quantize_config.json:

{
  "bits": 4,
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": true,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null
}

Pretty self explanatory. You'd just need to read bits and group_size from this file, found in the model folder, and it could work automatically.

I'd be happy to PR a change myself, if someone confirms it would be merged.

TheBloke commented 1 year ago

PS. Now I have confirmation that TGI works with the other formats, I will add mention of this fact to my READMEs.

And yes maybe the main = 'most compatible' is no longer correct in light of TGI. I called it that because it used to be that using GPTQ-for-LLaMa CUDA branch - which is what I use to make the GPTQ in main - would ensure the GPTQ would work with every local UI (text-generation-webui, KoboldAI, etc), including when partially offloaded to CPU. I briefly tried moving to using AutoGPTQ for all quants a few weeks back, and got complaints from some users that they then couldn't CPU offload it. Hence I stuck with GPTQ-for-LLaMa, and regarded it as 'most compatible'.

Maybe I'll call main the 'old' format.

Either way, I will clarify in the README which ones work with TGI.

kaleko commented 1 year ago

Similar issue for me, trying to get Vicuna 7B GPTQ models to run with TGI RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

So far haven't gotten it to work. Looking forward to the README updates with more info about using these models with TGI @TheBloke !

TheBloke commented 1 year ago

@kaleko I updated the Vicuna 7B v1.3 GPTQs yesterday. Download one of the models from the other branches, as listed under Provided Files. They will work if you manually set the GPTQ_BITS and GPTQ_GROUPSIZE as ssmi153 mentioned in the first post.

You can download from alternate branches using the REVISION parameter in TGI.

https://huggingface.co/TheBloke/vicuna-7B-v1.3-GPTQ

Narsil commented 1 year ago

is there any chance you could add support for the quantize_config.json file?

This is actually much cleaner than the ENV variables I added. I'm more than happy to switch to it. Is it correct to assume that for inference we can discard all other config other than bits and groupsize ? Just want to avoid loading/running a model and output garbage if we can raise an error early.

I'm actually very fine aligning and outputting such a config instead of putting the values in the weights. Wdyt @OlivierDehaene ? (If we can reduce the current split around GPTQ it's all for the best)

We also have some work to use a better GPTQ kernel: https://github.com/huggingface/text-generation-inference/pull/553 if that's interesting.

The reason for missing g_idx is the use of autogpt, correct ?

ssmi153 commented 1 year ago

The AutoGPTQ quants from TheBloke seem to work actually. It's the ones that TheBloke is converting using GPTQ-for-Llama that are causing the missing g_idx error. Weirdly though, the GPTQ model file that I've created using GPTQ-for-Llama loads correctly, so there must be a difference somewhere in the way TheBloke is processing those files, or in the settings he's chosen for them.

@TheBloke , out of interest, how do your GPTQ-for_Llama conversion settings compare to this?: %run -i 'GPTQ-for-LLaMa/llama.py' {INPUT_MODEL_FULL_NAME} wikitext2 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors {OUTPUT_MODEL_FULL_NAME_SAFETENSORS} My conversion of a 33B Llama model with these settings works with the current TGI implementation. Maybe we can identify any different settings on your side which might allow us to isolate where the problem is.

TheBloke commented 1 year ago

My settings are effectively identical. The reason that my GPTQ-for-LLaMA quants aren't working is that they're using the "old" GPTQ format. There's never been any official GPTQ version naming, but some implementations call it "v1", versus the current "v2" format which has g_idx.

AutoGPTQ produces the new format and can load either format, but TGI uses an implementation that can only load the newer v2 format.

Going forward I will always be making multiple GPTQs available, nearly all of which will be made with AutoGPTQ and will work. For now I am still also providing an old GPTQ-for-LLaMa produced version which doesn't work. I haven't decided yet if if I will continue making that old format version as well - I need to take a survey of the current state of all the various UIs out there and confirm they can all load the v2 format.

I wanted to switch to making all quants with AutoGPTQ a while back, but as soon as I did I got complaints from users using KobaldAI that the newer format models didn't work with CPU offload. So to avoid hassle (and because I lacked the time to test it on KoboldAI myself), I just went back to GPTQ-for-LLaMA for making Llama GPTQs.

That issue may even be fixed now, so I should re-evaluate that and hope to do so soon.

Over the last few days I've uploaded multiple options of AutoGPTQ-produced quants to 62 GPTQ repos. And all my non-Llama models (Falcon, MPT, Bloom, Starcoder) were already produced with AutoGPTQ, so should already be TGI compatible. Most of the repos I haven't done are ones I don't plan to do, because they're old and superseded. Eg I didn't do Vicuna v1.1, just v1.3.

My next step is to update my SuperHOT GPTQ repos also - that should happen this weekend.

TheBloke commented 1 year ago

is there any chance you could add support for the quantize_config.json file?

This is actually much cleaner than the ENV variables I added. I'm more than happy to switch to it. Is it correct to assume that for inference we can discard all other config other than bits and groupsize ? Just want to avoid loading/running a model and output garbage if we can raise an error early.

Fantastic, thanks!

The other parameter that sometimes matters is desc_act, which in GPTQ-for-LLaMa is called Act Order. With AutoGPTQ it needs to know whether that's true or false else inference will produce gibberish.

But as you don't have an ENV var for it, I assume that your GPTQ code is able to auto detect that somehow, so it presumably it isn't required. It isn't required in ExLlama so that's further confirmation that it is possible to auto detect it.

And yes the other params in the file can be ignored, they're irrelevant during inference.

I'm actually very fine aligning and outputting such a config instead of putting the values in the weights. Wdyt @OlivierDehaene ? (If we can reduce the current split around GPTQ it's all for the best)

We also have some work to use a better GPTQ kernel: #553 if that's interesting.

The reason for missing g_idx is the use of autogpt, correct ?

Sorry I missed your earlier reply when I replied to ssmi's ping, else I'd have replied to you directly, but this is answered above: AutoGPTQ outputs g_idx and the files work fine, it's the old GPTQ-for-LLaMa CUDA version I have been using that doesn't output g_idx.

I would really love to stop using that old GPTQ-for-LLaMa code and will do as soon as I've confirmed there's no need to do so any more.

But either way, I'll always have AutoGPTQ-produced GPTQs in future which it's confirmed TGI can load OK.

We also have some work to use a better GPTQ kernel: #553 if that's interesting.

Excellent! ExLlama's kernels are really amazing for performance and VRAM usage.

TheBloke commented 1 year ago

I've just tried to use TGI with GPTQs for myself for the first time, using the Docker container on a Lambda Labs H100 system.

Running unquantised models works fine, so TGI itself seems to be OK.

But I can't so far load any GPTQ models, because the server keeps crashing. There's no logs to help me debug this - is there any way I can get logs shown using the Docker?

Here's the full output of my attempting to run TheBloke/openchat_v2_openorca_preview-GPTQ, which @ssmi153 said was working for them. I tried the gptq-4bit-128g-actorder_True branch.

ᐅ docker run --rm --name tgi --shm-size=1gb -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 --runtime=nvidia --gpus all -p 8080:8080/tcp -v /workspace/data:/data ghcr.io/huggingface/text-generation-inference:0.9.2 --model-id TheBloke/OpenOrca-Preview1-13B-GPTQ --revision gptq-4bit-128g-actorder_True --hostname 0.0.0.0 --port 8080 --max-concurrent-requests 20  --quantize gptq
2023-07-17T13:39:01.039097Z  INFO text_generation_launcher: Args { model_id: "TheBloke/OpenOrca-Preview1-13B-GPTQ", revision: Some("gptq-4bit-128g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 20, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-17T13:39:01.039343Z  INFO text_generation_launcher: Starting download process.
2023-07-17T13:39:03.386135Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-17T13:39:03.844196Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-17T13:39:03.844670Z  INFO text_generation_launcher: Starting shard 0
2023-07-17T13:39:07.059936Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-17T13:39:13.066283Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-07-17T13:39:13.163861Z  INFO text_generation_launcher: Shard 0 ready in 9.317614619s
2023-07-17T13:39:13.252940Z  INFO text_generation_launcher: Starting Webserver
2023-07-17T13:39:13.527964Z  WARN text_generation_router: router/src/main.rs:165: Could not find a fast tokenizer implementation for TheBloke/OpenOrca-Preview1-13B-GPTQ
2023-07-17T13:39:13.528017Z  WARN text_generation_router: router/src/main.rs:168: Rust input length validation and truncation is disabled
2023-07-17T13:39:13.768428Z  INFO text_generation_router: router/src/main.rs:346: Serving revision 0b78faa0d35ea4386acafbf12dbbd6c014df25c0 of model TheBloke/OpenOrca-Preview1-13B-GPTQ
2023-07-17T13:39:13.778367Z  INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-17T13:39:14.960712Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-17T13:39:15.060420Z ERROR text_generation_launcher: Webserver Crashed
2023-07-17T13:39:15.060477Z  INFO text_generation_launcher: Shutting down shards
2023-07-17T13:39:15.068091Z ERROR text_generation_launcher: Shard process was signaled to shutdown with signal 6
Error: WebserverFailed

I can see with nvtop that it does start loading the model, then it crashes with no further info.

Here's a log of me loading unquantised, just to show that works fine:

Log of loading unquantised model (lmsys/vicuna-33b-v1.3) without problems

``` ᐅ docker run --rm --name tgi --shm-size=1gb --runtime=nvidia --gpus all -p 8080:8080/tcp -v /workspace/data:/data ghcr.io/huggingface/text-generation-inference:0.9.2 --model-id lmsys/vicuna-33b-v1.3 --hostname 0.0.0.0 --port 8080 --max-concurrent-requests 20 --max-batch-total-tokens 8192 2023-07-17T13:41:07.153247Z INFO text_generation_launcher: Args { model_id: "lmsys/vicuna-33b-v1.3", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 20, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 8192, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false } 2023-07-17T13:41:07.153553Z INFO text_generation_launcher: Starting download process. 2023-07-17T13:41:09.634012Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download. 2023-07-17T13:41:10.059871Z INFO text_generation_launcher: Successfully downloaded weights. 2023-07-17T13:41:10.060361Z INFO text_generation_launcher: Starting shard 0 2023-07-17T13:41:13.323937Z WARN shard-manager: text_generation_launcher: We're not using custom kernels. rank=0 2023-07-17T13:41:20.082090Z INFO text_generation_launcher: Waiting for shard 0 to be ready... 2023-07-17T13:41:27.101627Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 rank=0 2023-07-17T13:41:27.195299Z INFO text_generation_launcher: Shard 0 ready in 17.133052272s 2023-07-17T13:41:27.285728Z INFO text_generation_launcher: Starting Webserver 2023-07-17T13:41:27.601753Z WARN text_generation_router: router/src/main.rs:165: Could not find a fast tokenizer implementation for lmsys/vicuna-33b-v1.3 2023-07-17T13:41:27.601809Z WARN text_generation_router: router/src/main.rs:168: Rust input length validation and truncation is disabled 2023-07-17T13:41:27.601825Z WARN text_generation_router: router/src/main.rs:324: `--revision` is not set 2023-07-17T13:41:27.601833Z WARN text_generation_router: router/src/main.rs:325: We strongly advise to set it to a known supported commit. 2023-07-17T13:41:27.858779Z INFO text_generation_router: router/src/main.rs:346: Serving revision 7d7373f8b7c3ad92f7377562ad6a56938786faef of model lmsys/vicuna-33b-v1.3 2023-07-17T13:41:27.871117Z INFO text_generation_router: router/src/main.rs:212: Warming up model 2023-07-17T13:41:30.834414Z INFO text_generation_router: router/src/main.rs:221: Connected 2023-07-17T13:41:48.649128Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.68.0 otel.kind=server trace_id=f52482293b2f3e1b4cf6122aaeb11ab4}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 17, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="740.908633ms" validation_time="137.399µs" queue_time="124.915µs" inference_time="740.647103ms" time_per_token="43.567476ms" seed="None"}: text_generation_router::server: router/src/server.rs:289: Success ```

Thanks in advance

Narsil commented 1 year ago

I would really love to stop using that old GPTQ-for-LLaMa code and will do as soon as I've confirmed there's no need to do so any more.

You mean using https://github.com/PanQiWei/AutoGPTQ instead of https://github.com/qwopqwop200/GPTQ-for-LLaMa correct ? I used specifically https://github.com/qwopqwop200/GPTQ-for-LLaMa because I found the code easier to reason about.

Notably we don't need the modeling code at all in either lib. I was able to refactor the code to load nothing on the CPU by default, and pull only layer per layer (on CUDA, but could be CPU) for quantization. This makes EVERY model from transformers work too. (afaik at least, we're really just using AutoModelForCausalLM.from_pretrained). Makes everything a bit easier to work with. We currently don't provide a way to select the data sent during quantization but I'm not sure how much that really matter (Didn't seem to matter that much for llama derived models)

All that to say that I'm hesitant to pull from either codebase because they are quite large and we only use a very tiny fraction of the code. Trying to find common grounds would be nice.

OlivierDehaene commented 1 year ago

@TheBloke, TGI seems to have issues with H100s I'm not sure why yet. Any chance you could test on another device? I was able to launch the model on 1xA10 for example.

You can also use ghcr.io/huggingface/text-generation-inference:sha-44acf72 with the env var LOG_LEVEL=info,text_generation_launcher=debug for more logs.

TheBloke commented 1 year ago

@Narsil Sorry, I didn't mean you should stop using GPTQ-for-LLaMa. I was just talking about my own quantising processes. I meant I would like to stop using the old CUDA fork of GPTQ-for-LLaMa for making new quants and uploading them to HF.

I have used an old CUDA fork because for a long time for my 'main' branch GPTQs because there were UIs out there that couldn't support the g_idx GPTQ format in all scenarios. But that might not be the case any more and I plan to check that.

If you don't even use the modelling code then it sounds like you have a very lightweight implementation so that's great.

@OlivierDehaene Thank you. And I just noticed there's already an issue posted for this, so I'll move there.

TheBloke commented 1 year ago

I just realised the H100 issue is already reported here: https://github.com/huggingface/text-generation-inference/issues/613 . I'll post there and stop de-railing this thread as it's obviously H100 specific.

(For completeness on this thread, looks like it's because the pytorch 2.0 in the Docker doesn't support compute 90. If I make a new container with Pytorch 2.1 nightly I think it will work.)

ssmi153 commented 1 year ago

@TheBloke - I just did some benchmarking of TGI on Runpod instances using a range of GPU combinations. (https://docs.google.com/spreadsheets/d/1Ph_GeybAtNVoTs7w4mkCfd7p1lGywsNhJf9z-8fTcUE/edit?usp=sharing - if you're interested). This benchmarking was designed to reflect how I would use TGI and may not be as robust as some of the more formal benchmarks. One thing to note when benchmarking using the Docker image is that I get (and also you look like you've got) a warning saying WARN shard-manager: text_generation_launcher: We're not using custom kernels. . I think this is because by default the combination of Runpod + the docker image doesn't have NVIDIA NVCC installed (it requires the developer version of the NVIDIA container toolkit rather than the standard one), so it can't build the custom kernels or vLLM for PagedAttention. I'm struggling to work out whether this warning is a red herring or not, as TGI is still impressively fast. If it is indeed correct, then there should be a further uplift in performance above and beyond what I've seen so far. I haven't yet created an issue about the warning because it felt like I was raising a million Runpod related issues, but if other people are also seeing this (and on other platforms using the Docker image) then it might be worth exploring it further.

OlivierDehaene commented 1 year ago

@ssmi153, this warning is a bit dismissive. If you don't see import errors and your architecture is one of the optimized architecture (as displayed in the README), you are using flash and paged attention. This warning only applies to BLOOM and non-flash neox.

Ichigo3766 commented 1 year ago

@ssmi153 Can you confirm the new llama 2 gptq versions working? I am getting error with Bloke's autogptq branch: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

WizardCoder is also giving an error when trying to load gptq version of it.:

raise RuntimeError(f"weight {tensor_name} does not exist") RuntimeError: weight gptq_bits does not exist

I am running the docker command with the variables you mentioned at start and there are models that do work so wondering why this one is throwing error as it works fine in fp16.

Ichigo3766 commented 1 year ago

The fix seems to work for llama 70b but is pretty slow. Thank you!

Was wondering if you got the time to check wizardcoder. Still having the issues of loading the gptq version:

raise RuntimeError(f"weight {tensor_name} does not exist") RuntimeError: weight gptq_bits does not exist

fxmarty commented 1 year ago

@bloodsucker99 You need to pass the environment variable GPTQ_BITS (though I think gptq_bits and gptq_groupsize could be directly inferred from the shapes of qweights, qzeros, g_idx?)

TheBloke commented 1 year ago

Narsil was also looking into adding automatic support for quantize_config.json which would provide the bits and group_size without the user needing to specify any env vars.

But yeah, I think ExLlama can automatically detect quantise config, so maybe neither env vars nor quantize_config.json are needed and it can just be auto detected? That would be the ideal scenario.

keelezibel commented 1 year ago

@TheBloke I am trying to run your quantized model: TheBloke/Llama-2-70B-chat-GPTQ on V100 but TGI is complaining about compute capability < 7.5 detected. Is it that flash attn is not supported for V100 GPU? Does that mean all GPTQ models can only run on Ampere or later cards as well?

Narsil commented 1 year ago

Indeed flash isn't supported on V100, and sharding requires flash for llama.

keelezibel commented 1 year ago

Indeed flash isn't supported on V100, and sharding requires flash for llama.

I tried to set env var for Sharding to False. But it doesn’t work too. Means I definitely have to load full model on v100 only? I rmb I tried bitsandbytes and it worked but was relatively slower since there is offloading to ram.

keelezibel commented 1 year ago

@Narsil don’t mind if I clarify, this PR won’t help with running gptq models on older models? It’s just for reading the config from the model folder for gptq?

Narsil commented 1 year ago

older models?

what do you mean ? All TheBlock models have this quantization configuration, no ?

keelezibel commented 1 year ago

older models?

what do you mean ? All TheBlock models have this quantization configuration, no ?

Sorry i meant this PR wont help to resolve running GPTQ models on older GPU cards such as V100, right?

Ichigo3766 commented 1 year ago

So the problem is in flash santacoder modeling, its trying to use to bits values from the weights and not using environment values. Manually forced the use if env variables fixed the issue :)

taoari commented 1 year ago

@TheBloke When I tried to run with TheBloke/Llama-2-7b-Chat-GPTQ, I got the following error:

warmup{max_input_length=4096 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens"))

Actually, I am able to serve the official meta-llama/Llama-2-7b-chat-hf with --quantize bitsandtyes on single T4 GPU. When I change the model to TheBloke/Llama-2-7b-Chat-GPTQ with --quantize gptq. I got NOT enough memory error. Even when I changed --max-batch-prefill-tokens=2048, this error still happens. Since the bitsandtypes-quantized version can be served on single T4 GPU, gptq-quantized version should be of no problem, do you happen to know why?

AIApprentice101 commented 1 year ago

I have the same error when trying to load TheBloke/Llama-2-7b-Chat-GPTQ

samos123 commented 1 year ago

I'm hitting the same issue as @taori when trying llama 2 70b chat. The error message:

"level":"ERROR","message":"Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"name":"warmup"},"spans":[{"max_input_length":1024,"max_prefill_tokens":4096,"name":"warmup"},{"name":"warmup"}]}

This is my YAML manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: ghcr.io/huggingface/text-generation-inference:1.0.3
        resources:
          limits:
            nvidia.com/gpu: 2
        env:
        - name: MODEL_ID
          value: TheBloke/Llama-2-70B-chat-GPTQ
        - name: NUM_SHARD
          value: "2"
        - name: QUANTIZE
          value: gptq
        - name: GPTQ_BITS
          value: "4"
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference

GPTQ Formats that work (and don't) #601