oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.8k stars 5.33k forks source link

Support for LLaMA models #147

Closed ye7iaserag closed 1 year ago

ye7iaserag commented 1 year ago

Meta just released there LLaMA model family https://github.com/facebookresearch/llama Can we got support for that? They calim that the 13B model is better than GPT 3 175B model


This is how to use the model in the web UI:

LLaMA TUTORIAL

-- oobabooga, March 9th, 2023

oobabooga commented 1 year ago

The models are not public yet, unfortunately. You have to request access.

MetaIX commented 1 year ago

Psst. Somebody leaked them https://twitter.com/Teknium1/status/1631322496388722689

Sumanai commented 1 year ago

Of course, the weights themselves are closed. But the code from the repository should be enough to add support. And where the end users will download the weights from is their problem.

catboxanon commented 1 year ago

https://github.com/facebookresearch/llama/pull/73

oobabooga commented 1 year ago

Done https://github.com/oobabooga/text-generation-webui/commit/ea5c5eb3daa5d3f319f4a6dbc6d02b7f993d1881

  1. Install LLaMa as in their README:
conda activate textgen
git clone https://github.com/facebookresearch/llama
cd llama
pip install -r requirements.txt
pip install -e .
  1. Put the model that you downloaded using your academic credentials on models/LLaMA-7B (the folder name must start with llama)

  2. Put a copy of the files inside that folder too: tokenizer.model and tokenizer_checklist.chk

  3. Start the web ui. I have tested with

python server.py --no-stream --model LLaMA-7B

llamas

moorehousew commented 1 year ago

Getting a CUDA out-of-memory error- I assume lowmem support isn't included yet?

oobabooga commented 1 year ago

This isn't part of Hugging Face yet, so it doesn't have access to 8bit and CPU offloading.

The 7B model uses 14963MiB VRAM on my machine. Reducing the max_seq_len parameter from 2048 to 512 makes this go down to 13843MiB.

musicurgy commented 1 year ago

I get a bunch of dependency errors when launching despite setting up LLaMa beforehand (definitely my own fault and probably because of a messed up conda environment)

ModuleNotFoundError: No module named 'fire'
ModuleNotFoundError: No module named 'fairscale'

etc. Any chance you could include these in the default webui requirements assuming they aren't too heavy?

oobabooga commented 1 year ago

@musicurgy did you try pip install -r requirements.txt as in https://github.com/oobabooga/text-generation-webui/issues/147#issuecomment-1453880733?

musicurgy commented 1 year ago

Yeah, after a bit of a struggle I ended up getting it working by just copying all the dependencies into the webui folder. So far the model is really interesting. Thanks for supporting it.

generic-username0718 commented 1 year ago

Awesome stuff. I'm able to load LLaMA-7b but trying to load LLaMA-13b crashes with the error:

Traceback (most recent call last):
  File "/home/user/Documents/oobabooga/text-generation-webui/server.py", line 189, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 44, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=2 but world size is 1
generic-username0718 commented 1 year ago

Anyone reading this you can get past the issue above by changing the world_size variable found in modules/LLaMA.py like this:

def setup_model_parallel() -> Tuple[int, int]: local_rank = int(os.environ.get("LOCAL_RANK", -1)) world_size = 2

My issue now is I'm running out of VRAM. I'm running dual 3090s and should be able to load the model if it's split among the cards...

generic-username0718 commented 1 year ago

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Morb0 commented 1 year ago

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Try --gpu-memory 10 5, at least that's what the README says.

generic-username0718 commented 1 year ago

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Morb0 commented 1 year ago

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Ah, that should work, but if not, edit the file and add this at the end of call python server.py --auto-devices --cai-chat

generic-username0718 commented 1 year ago

Thanks friend! I was able to get it with call python server.py --gpu-memory 20 20 --cai-chat

oobabooga commented 1 year ago

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

generic-username0718 commented 1 year ago

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

I was starting to question my sanity... I think I accidentally was loading opt-13b instead... Sorry if I got people's hopes up

I'm still trying to split the model

Edit: Looks like they've already asked this here: https://github.com/facebookresearch/llama/issues/88

USBhost commented 1 year ago

bad news for the guys hoping to run 13B

Loading LLaMA-13B...
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "/UI/text-generation-webui/server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/UI/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/UI/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/UI/text-generation-webui/modules/LLaMA.py", line 61, in load
    model.load_state_dict(checkpoint, strict=False)
  File "/UI/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
        size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 2560]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
        size mismatch for layers.0.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wo.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.feed_forward.w1.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([5120, 6912]) from checkpoint, the shape in current model is torch.Size([5120, 13824]).
        size mismatch for layers.0.feed_forward.w3.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.1.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        etc........
oobabooga commented 1 year ago

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

MarkSchmidty commented 1 year ago

LLaMA-7B can be run on CPU instead of GPU using this fork of the LLaMA repo: https://github.com/markasoftware/llama-cpu

To quote the author "On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!"

USBhost commented 1 year ago

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size

I sure did. Also those os.environ don't seem to work. 7B loads fine. PS my GPU is a A6000

TheZennou commented 1 year ago

Did you set MP to '2' here? https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18 See https://github.com/facebookresearch/llama#inference

from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size

I sure did. Also those os.environ don't seem to work. 7B loads fine. PS my GPU is a A6000

I also get the same error, with 13b.

hdelattre commented 1 year ago

Anyone else getting really poor results on 7B? I've tried many prompts and parameter variations and it generally ends up as mostly nonsense with lots of repetition. It might just be the model but I saw some 7B output examples posted online that seemed way better than anything I was getting.

BarsMonster commented 1 year ago

Is it possible to reduce computation precision on CPU? Down to 8 bit?

Manimap commented 1 year ago

Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8

Zero idea if it works or anything.

hopto-dot commented 1 year ago

I'm getting the following error when trying to run the 7B model on my rtx 3090, can someone help?

C:\Users\Username\Documents\Git\text-generation-webui>python server.py --listen --no-stream --model LLaMA-7B
Loading LLaMA-7B...
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "C:\Users\Username\Documents\Git\text-generation-webui\server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 58, in load
    torch.set_default_tensor_type(torch.cuda.HalfTensor)
  File "C:\Users\Username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\__init__.py", line 348, in set_default_tensor_type
    _C._set_default_tensor_type(t)
TypeError: type torch.cuda.HalfTensor not available. Torch not compiled with CUDA enabled.
hdelattre commented 1 year ago

@hopto-dot Go here and run the pip command for the 11.7 build on your OS: https://pytorch.org/get-started/locally/

hopto-dot commented 1 year ago

Thank you, I'll try that

jangofett890 commented 1 year ago

If anyone else is having issues installing the requirements for LLaMA, the conda enviroment as you used the installer, I made a bat for setting the conda enviroment and installing the requirements based on the one in the folder. `@echo off

@echo Loading the local Conda Enviroment and getting LLaMA and it's requirements...

set INSTALL_ENV_DIR=%cd%\installer_files\env set PATH=%INSTALL_ENV_DIR%;%INSTALL_ENV_DIR%\Library\bin;%INSTALL_ENV_DIR%\Scripts;%INSTALL_ENV_DIR%\Library\usr\bin;%PATH% call conda activate git clone https://github.com/facebookresearch/llama cd llama pip install -r requirements.txt pip install -e .

pause`

neuhaus commented 1 year ago

Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8 Zero idea if it works or anything.

Great stuff, Llama-13B INT8 works for me on a RTX 3090! @oobabooga could you give it a try to get the Llama-INT8 changes incorporated?

wywywywy commented 1 year ago

Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8 Zero idea if it works or anything.

Great stuff, Llama-13B INT8 works for me on a RTX 3090! @oobabooga could you give it a try to get the Llama-INT8 changes incorporated?

It's worth mentioning that llama-int8 uses bitsandbytes which only works on Linux, not Windows, as far as I know. It may work in WSL though?

USBhost commented 1 year ago

https://github.com/facebookresearch/llama-recipes/issues/172 last comment explains how to load 13B on one GPU

oobabooga commented 1 year ago

The 8-bit version is very exciting, but it needs to have an optional load_in_8bit=True flag while loading the model to let users choose whether they want 8-bit or not without having to reinstall different versions of the llama library multiple times.

happyme531 commented 1 year ago

Int8 fork works great! Looking for webui support...

oobabooga commented 1 year ago

LLaMA 8-bit is now implemented: https://github.com/oobabooga/text-generation-webui/commit/bd8aac8fa43daa7bd0e2d3d2e446a403a447c744

To use it:

  1. Clone the llama-int8 repository inside text-generation-webui/repositories/llama_int8 (it is important to rename the - to an underscore _):
cd text-generation-webui
mkdir repositories
cd repositories
git clone https://github.com/tloen/llama-int8
mv llama-int8 llama_int8
pip install -r llama_int8/requirements.txt
  1. Start the web UI with the --load-in-8bit flag:
python server.py --model LLaMA-13B --load-in-8bit

Make sure to put a copy of the tokenizer files inside models/LLaMA-13B exactly as we did with LLaMA-7B above.

The GPU memory usages in 8-bit mode are the following:

13b

perkel666 commented 1 year ago

@oobabooga

downloaded latest git zip run installer did what in https://github.com/oobabooga/text-generation-webui/issues/147#issuecomment-1454798725 put model in models

error: Traceback (most recent call last): File "C:\AI\oobabooga\text-generation-webui\server.py", line 188, in shared.model, shared.tokenizer = load_model(shared.model_name) File "C:\AI\oobabooga\text-generation-webui\modules\models.py", line 92, in load_model import modules.LLaMA_8bit File "C:\AI\oobabooga\text-generation-webui\modules\LLaMA_8bit.py", line 8, in import fire ModuleNotFoundError: No module named 'fire'

It seems like dependency issue. Fire is module you can install via pip install fire but it needs to be added to requirements.txt or added manually to textgenui environment

ewof commented 1 year ago

zsh kills my process everytime i try to load llama-7B with int8 on a 3060

python server.py --listen --model llama-7B --load-in-8bit
Loading llama-7B...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Creating transformer
Transformer created
Loading checkpoint 0
zsh: killed     python server.py --listen --model llama-7B --load-in-8bit
GreenGarnets commented 1 year ago

There seems to be an issue with bitsandbytes not finding CUDA, I'm getting the same error, libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats.

https://github.com/TimDettmers/bitsandbytes/issues/175 https://github.com/TimDettmers/bitsandbytes/issues/176

oobabooga commented 1 year ago

@perkel666 I forgot to add that you need to install the requirements:

pip install -r llama_int8/requirements.txt

@GreenGarnets Bitsandbytes doesn't work properly on Windows yet. See: here and here for a workaround.

@elwolf6 check your RAM usage, maybe the process is being killed because too much RAM is being used. Increase your swap size if needed.

Loufe commented 1 year ago

Thanks @GreenGarnets , I started installing them manually. Nobody else getting this error once those dependencies installled?:

Starting the web UI... Loading LLaMA-7B... Traceback (most recent call last): File "D:\text-generation-webui\text-generation-webui\server.py", line 188, in <module> shared.model, shared.tokenizer = load_model(shared.model_name) File "D:\text-generation-webui\text-generation-webui\modules\models.py", line 92, in load_model import modules.LLaMA_8bit File "D:\text-generation-webui\text-generation-webui\modules\LLaMA_8bit.py", line 16, in <module> from repositories.llama_int8.llama import ModelArgs, Transformer, Tokenizer, LLaMA ModuleNotFoundError: No module named 'repositories' Press any key to continue . . .

YakuzaSuske commented 1 year ago

i get this CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment! If you cannot find any issues and suspect a bug, please open an issue with detals about your environment: https://github.com/TimDettmers/bitsandbytes/issues Press any key to continue . . .

MetaIX commented 1 year ago

Followed instructions, tested on windows 10, everything seems to be working. For those who might miss this, if you are on windows, make sure to use the correct version of bitsandbytes. https://github.com/oobabooga/text-generation-webui/issues/20#issuecomment-1411650652

GreenGarnets commented 1 year ago

Thanks @GreenGarnets , I started installing them manually. Nobody else getting this error once those dependencies installled?:

Starting the web UI... Loading LLaMA-7B... Traceback (most recent call last): File "D:\text-generation-webui\text-generation-webui\server.py", line 188, in <module> shared.model, shared.tokenizer = load_model(shared.model_name) File "D:\text-generation-webui\text-generation-webui\modules\models.py", line 92, in load_model import modules.LLaMA_8bit File "D:\text-generation-webui\text-generation-webui\modules\LLaMA_8bit.py", line 16, in <module> from repositories.llama_int8.llama import ModelArgs, Transformer, Tokenizer, LLaMA ModuleNotFoundError: No module named 'repositories' Press any key to continue . . .

You should have llama-int8 in your folder "repositories".

Loufe commented 1 year ago

Thanks @GreenGarnets , I started installing them manually. Nobody else getting this error once those dependencies installled?: Starting the web UI... Loading LLaMA-7B... Traceback (most recent call last): File "D:\text-generation-webui\text-generation-webui\server.py", line 188, in <module> shared.model, shared.tokenizer = load_model(shared.model_name) File "D:\text-generation-webui\text-generation-webui\modules\models.py", line 92, in load_model import modules.LLaMA_8bit File "D:\text-generation-webui\text-generation-webui\modules\LLaMA_8bit.py", line 16, in <module> from repositories.llama_int8.llama import ModelArgs, Transformer, Tokenizer, LLaMA ModuleNotFoundError: No module named 'repositories' Press any key to continue . . .

You should have llama-int8 in your folder "repositories".

image

It's there, with the correct rename. Am I missing something here? Why is python supposed to recognize the "repositories" folder as an object?

oobabooga commented 1 year ago

The biggest bottleneck now is that only temperature and top_p are being used. The quality of the outputs would become a lot better if repetition_penaltyand top_k were added.

This could be adapted: https://rentry.org/llama_few_more_samplers

perkel666 commented 1 year ago

@perkel666 I forgot to add that you need to install the requirements:

pip install -r llama_int8/requirements.txt

@GreenGarnets Bitsandbytes doesn't work properly on Windows yet. See: here and here for a workaround.

@elwolf6 check your RAM usage, maybe the process is being killed because too much RAM is being used. Increase your swap size if needed.

Thanks. Further problem lol

it loads just fine but it spews out

This site can’t be reached The webpage at http://0.0.0.0:7860/ might be temporarily down or it may have moved permanently to a new web address.

for some reason it set ip to 0.0.0.0:7860:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Creating transformer Transformer created Loading checkpoint 0 Loading checkpoint 1 Quantizing 281 layers 100%|█████████████████████████████████████████| 281/281 [00:46<00:00, 6.04it/s] Loaded in 175.92 seconds Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().

oobabooga commented 1 year ago

Remove --listen

ewof commented 1 year ago

@perkel666 I forgot to add that you need to install the requirements:

pip install -r llama_int8/requirements.txt

@GreenGarnets Bitsandbytes doesn't work properly on Windows yet. See: here and here for a workaround.

@elwolf6 check your RAM usage, maybe the process is being killed because too much RAM is being used. Increase your swap size if needed.

ye i got 27gb ram free 12gb vram free still kills process idk