Make this pip installable

winglian commented 1 year ago

this is a pretty big refactor to:

allow anyone to use any of the submodules in the repo
I've removed a cyclical dependency
the repo no longer requires cuda/gptq if you prefer to use the triton backend, you can choose or the other (pip install .[cuda] or pip install .[triton]

There is some other cleanup that probably needs to be done, but I figured I should see if you want to go down this path. thanks!

johnsmith0031 commented 1 year ago

Thanks for doing this! I think I would also merge the cuda kernel used into this repo so that external dependency on GPTQ fork would be no longer needed. I think it would have better compatibility with main GPTQ.

winglian commented 1 year ago

problem is the main gptq doesn't even keep the cuda kernel around anymore, they've hitched their horse to triton.

delete kernel: https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/2d3256b69534b2d33c864d791868a4ff2b038aff deletequant_cuda.cpp: https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/e43c50696e0ca54215f8f491e468fc4885a1d003

winglian commented 1 year ago

alright, I've moved the quant_cuda into this repo. because of the way setuptools works, it's nearly impossible to make the cudaextension an extras without it being a separate external package, so it will get installed by default and triton is optional

johnsmith0031 commented 1 year ago

Thanks you for putting everything together! I made a PR to text-generation-webui, once it is merged I'll merge the PR into main. And I think we should adjust the Dockerfile for pip installable alpaca_lora_4bit as well for compatibility.

winglian commented 1 year ago

I took a pass at updating the dockerfile, but I don't have cuda on my local machine so can't validate that it's totally correct, if someone else has a chance to look at the dockerfile and build/run it 🙏

johnsmith0031 commented 1 year ago

Thanks for everything done here! I think I'll temporarily keep it in winglian-setup_pip branch for those who want to use the pip installable version and the old version as main branch for compatibility with monkeypatch code in webui. May merge them if something changes in the future

myyk commented 1 year ago

Still seeing an error when trying to run from Docker. I don't know what's going on enough to fix this, but it's not working by simply running pip install triton. It seems to me like quant_cuda not found. is coming from matmul_utils_4bit.py not finding the quant_cuda folder.

Well anyway, I got my machine back up so that I can help test this.


==========
== CUDA ==
==========

CUDA Version 11.7.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

quant_cuda not found. Please run "pip install alpaca_lora_4bit[cuda]".
Triton not found. Please run "pip install triton".
WARNING:root:Neither gptq/cuda or triton backends are available.
Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/server.py", line 1, in <module>
    import custom_monkey_patch # apply monkey patch
  File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 7, in <module>
    from models import Linear4bitLt
  File "/alpaca_lora_4bit/text-generation-webui/models.py", line 6, in <module>
    from peft.tuners.lora import is_bnb_available, Linear, Linear8bitLt, LoraLayer
ImportError: cannot import name 'Linear8bitLt' from 'peft.tuners.lora' (/root/.local/lib/python3.10/site-packages/peft/tuners/lora.py)

myyk commented 1 year ago

I think that last change improved it, but there's still something off. I upgraded CUDA to 11.8 because I don't think 11.7 is working with my driver and it's on it's way out anyway.

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

quant_cuda not found. Please run "pip install alpaca_lora_4bit[cuda]".
Triton not found. Please run "pip install triton".
WARNING:root:Neither gptq/cuda or triton backends are available.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/root/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Traceback (most recent call last):
  File "/alpaca_lora_4bit/text-generation-webui/server.py", line 1, in <module>
    import custom_monkey_patch # apply monkey patch
  File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 8, in <module>
    replace_peft_model_with_int4_lora_model()
  File "/alpaca_lora_4bit/text-generation-webui/monkeypatch/peft_tuners_lora_monkey_patch.py", line 4, in replace_peft_model_with_int4_lora_model
    from ..models import GPTQLoraModel
ImportError: attempted relative import beyond top-level package

nealchandra commented 1 year ago

I believe this branch is missing this commit https://github.com/johnsmith0031/alpaca_lora_4bit/commit/94851cec68fd336784feebdae1f99adc9b4b9d3b which at least for me causes a breaking error during build.

I'm curious about the vision for this project, is the intent primarily to support folks who just want an easy way to run text-generation-webui with 4bit quantization? This seems like the case to me (for instance the inference.py code does not actually apply a LoRA, the best example for inference is actually in the webui monkeypatch).

I think it is useful if that is the case, but for me this project would be even more valuable if it moved in the direction of this PR -- e.g. creating a core library which supports running inference and finetunes against multiple model types. This abstraction would then make it easy to plug this into the webui, or an API wrapper, or directly embed in some other python project. It seems hard to accomplish that goal without at least merging this PR back into the trunk.

tensiondriven commented 1 year ago

This seems like the case to me

I am using it for a different purpose - To run local training at 4-bit via scripts in an automated and repeatable fashion. It's important to me that I be able to run it separately from text-generation-webui, so I'd hate to lose that functionality.

tensiondriven commented 1 year ago

creating a core library which supports running inference and finetunes against multiple model types

I'm sure @johnsmith0031 would know better than me, but I expect that this project's functionality will eventually be exposed in HuggingFace Transformers or other large packages. This project is very cutting-edge, and does things that haven't previously been possible. I like where your intention is, and, I wouldn't want this project to get formalized to the point where it loses the agility needed to support features that are sometimes only a few days old.

urbien commented 1 year ago

@johnsmith0031 have you seen LocalAI project? It creates an OpenAI-compatible server / API wrapper and supports multiple models simultaneously. I want to use my own open source web/mobile app so it fits, but it is designed for CPU-based execution around GGML library, which while being awesome, is too slow and not even possible for 30b models. So this project, with LoRAs + 4bit + flash-attention optimizations to serve 30b models from 3090-level single GPU would be just heaven! But I had trouble getting it running, let alone start the fine tuning on my own data (I have personal datasets I want to create my own loras on and experiment with multiple different loras on top of the base model). I am a newbie in deep learning, so I might be missing things. In any case, thank you so much for putting this together.

johnsmith0031 commented 1 year ago

Thanks! Currently the hosting mode is compatible with text generation webui, which have better inference performance. Feel free to have a try!

johnsmith0031 / alpaca_lora_4bit

Make this pip installable #82