Using Exllama backend requires all the modules to be on GPU - how?

turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

MIT License

2.66k stars 214 forks source link

I'm sorry I am unable to find relevant doc on Internet on how to load all modules on GPU.

I got this error message from my code:

Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object

A snippet from my code (to make it work, I had to uncomment the config part, but it won't be using Exllama)

    MODEL_ID = "TheBloke/Llama-2-13b-Chat-GPTQ"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    # config = AutoConfig.from_pretrained(MODEL_ID)
    # config.quantization_config["disable_exllama"] = True

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        # config=config,
    )

Any help is greatly appreciated!

turboderp / exllama

Using Exllama backend requires all the modules to be on GPU - how? #306