qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
Apache License 2.0
2.99k stars 459 forks source link

Benchmark broken on H100 #231

Open FrederikAbitz opened 1 year ago

FrederikAbitz commented 1 year ago
(textgen) ubuntu@anon:~/text-generation-webui/repositories/GPTQ-for-LLaMa$ stdbuf --output=L python -u llama.py ~/text-generation-webui/models/llama-7b-hf c4 \
>     --wbits 4 \
>     --groupsize 128 \
>     --load ~/text-generation-webui/models/llama-7b-4bit-128g_true-seq_act-order.safetensors \
>     --benchmark 2048 \
>     --check 2>&1 \
> | tee llama-7b-4bit-128g_true-seq_act-order_bench.log
Loading model ...
/home/ubuntu/miniconda3/envs/textgen/lib/python3.11/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
Found 3 unique KN Linear values.
Warming up autotune cache ...
  0%|          | 0/12 [00:00<?, ?it/s]python: /opt/conda/conda-bld/torchtriton_1677881353797/work/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion `false && "computeCapability > 90 not supported"' failed.

Quantization itself works, only the benchmark is broken as of 05781593c818d4dc8adc2d32c975e83d17d2b9a8.