qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
Apache License 2.0
2.99k stars 459 forks source link

Sample code does not work #243

Open foamliu opened 1 year ago

foamliu commented 1 year ago

Thanks for the great work, here are errors from my side (one host with eight V100 GPUs):

CUDA_VISIBLE_DEVICES=0 python llama_inference.py /home/xxx/models/hf_converted_llama/7B/ --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" Loading model ... Found 3 unique KN Linear values. Warming up autotune cache ... 100%|████████████████████| 12/12 [00:33<00:00, 2.80s/it] Found 1 unique fused mlp KN values. Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s]python: /project/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Aborted (core dumped)

kk3dmax commented 1 year ago

I have the same issue.

DalasNoin commented 1 year ago

have you tried to use the --no_fused_mlp option when running the command? If this solves the issue we can add it to the readme and close this issue. I added the option because I had the same error.