lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
4.01k stars 332 forks source link

Is it possible to use AirLLM with a quantized input model? #117

Open Verdagon opened 6 months ago

Verdagon commented 6 months ago

Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!

I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.

Is it possible to quantize the input model to make it faster?

I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.

Any pointers in the right direction would help. Thanks!

Verdagon commented 6 months ago

I just re-read the README again and learned about the compression option!

However, it doesn't quite work, I get this error:

Traceback (most recent call last):
  File "/Users/verdagon/AirLLM/air_llm/main.py", line 12, in <module>
    model = AutoModel.from_pretrained(
  File "/Users/verdagon/AirLLM/air_llm/airllm/auto_model.py", line 49, in from_pretrained
    return AirLLMLlamaMlx(pretrained_model_name_or_path, *inputs, ** kwargs)
  File "/Users/verdagon/AirLLM/air_llm/airllm/airllm_llama_mlx.py", line 224, in __init__
    self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 351, in find_or_create_local_splitted_path
    return Path(model_local_path_or_repo_id), split_and_save_layers(model_local_path_or_repo_id, layer_shards_saving_path,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 303, in split_and_save_layers
    layer_state_dict = compress_layer_state_dict(layer_state_dict, compression)
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 169, in compress_layer_state_dict
    v_quant, quant_state = bnb.functional.quantize_blockwise(v.cuda(), blocksize=2048)
  File "/Users/verdagon/Library/Python/3.9/lib/python/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I tried changing that v.cuda() to v.cpu() but it didn't help, instead I get an error down in bitsandbytes.

And reading the bitsandbytes docs, it says that bitsandbytes is a CUDA library, so I'm guessing this compression feature is only meant for CUDA computers. They are working on supporting Mac but not done yet. Unfortunate!

Hopefully there's a way to quantize the input instead.

Verdagon commented 4 months ago

Looking at the code more, it looks like AirLLM only supports pytorch and safetensors file formats. This might work if I can get something quantized into one of those.

lyogavin commented 4 months ago

will add.