huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.17k stars 1.08k forks source link

Please add support for neural-chat-7b-v3-1 #1284

Closed odellus closed 11 months ago

odellus commented 1 year ago

Model description

I'm using neural-chat-7b-v3-1 locally on my laptop and it would sure be sweet if I could serve it through tgi.

I can currently use it with python using the pattern

import torch
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

fpath = '/home/thomas/src/neural-chat-7b-v3-1'

tokenizer = AutoTokenizer.from_pretrained(fpath)
model = AutoModelForCausalLM.from_pretrained(
    fpath, 
    device_map = 'auto',
    quantization_config = quantization_config,
)

but when I try to pass the path of the repo I cloned through to tgi I get

(text-generation-inference) thomas@computer-1:~/src/notes/projects/assistant/backend/text-generation-inference$ target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4
2023-11-24T15:56:31.756672Z  INFO text_generation_launcher: Args { model_id: "/home/thomas/src/neural-chat-7b-v3-1", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-11-24T15:56:31.756747Z  INFO download: text_generation_launcher: Starting download process.
2023-11-24T15:56:33.786649Z  INFO text_generation_launcher: Peft model detected.

2023-11-24T15:56:33.786688Z  INFO text_generation_launcher: Loading the model it might take a while without feedback

2023-11-24T15:56:34.161062Z ERROR download: text_generation_launcher: Download encountered an error: Traceback (most recent call last):

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 117, in from_pretrained
    config_file = hf_hub_download(

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/thomas/src/neural-chat-7b-v3-1'. Use `repo_type` argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/peft.py", line 16, in download_and_unload_peft
    model = AutoPeftModelForCausalLM.from_pretrained(

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/auto.py", line 69, in from_pretrained
    peft_config = PeftConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 121, in from_pretrained
    raise ValueError(f"Can't find '{CONFIG_NAME}' at '{pretrained_model_name_or_path}'")

ValueError: Can't find 'adapter_config.json' at '/home/thomas/src/neural-chat-7b-v3-1'

So I'm seeing an error that appears to be related to #1283 in addition to tgi complaining there's no adapter_config.json which is odd because the repo has the full model and is not a peft adapter. But I mean it's doesn't even look like it can see the local repo so I don't know.

Open source status

Provide useful links for the implementation

https://huggingface.co/Intel/neural-chat-7b-v3-1

odellus commented 1 year ago

This is a mistral model so it should be supported, correct?

odellus commented 1 year ago

I was able to get around the error where text-generation-launcher is trying to read in the path like it is a PEFT LoRA model by commenting out the following lines from server/text-generation-server/cli.py. I'd personally try to load a local path as normal weights before trying to treat it like a PEFT model, but what do I know. https://github.com/huggingface/text-generation-inference/blob/3c02262f29fba3bf1096f22017f70d59ff4daa4d/server/text_generation_server/cli.py#L153-L163

Currently stuck installing dropout-layer-norm package. Had to cd server && make install-vllm && make install-flash-attention-v2, which does not actually install dropout-layer-norm. To try that I had to follow the instructions found here. But I'm having issues because when I try to install xformers for vllm it downgrades transformers to 4.33 and cuda to 11.7, which breaks my installation of flash-attention I compiled with pytorch-2.1 and cuda-12.1. So I'm basically in dependency hell and could use a clear set of steps to actually run mistral models on tgi please.

odellus commented 1 year ago

I was able to build dropout-layer-norm after creating a fresh environment. Steps I've taken:

# mise en place
git clone https://github.com/huggingface/text-generation-inference
cd text-generation-inference

# Create a new environment
conda create -n text-generation-inference python=3.10
conda activate text-generation-inference

# build tgi server
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# This fails for me the first time because nvcc is not installed

# So we install nvcc
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit

# Install it again
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# that should work now that nvcc is installed

# build flash attention v2
cd server
make install-flash-attention-v2
# This will install ninja so set MAX_JOBS
export MAX_JOBS=2

# now build dropout-layer-norm
cd flash-attention-v2/csrc/layer_norm
python -m pip install .

# okay figured it would be a good time to try it out
cd ../../../..

# we should be back in text-generation-inference root directory now
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4

and I'm now seeing

2023-11-25T10:55:06.159659Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'cache_ops' from 'vllm' (unknown location)

2023-11-25T10:55:06.163638Z  WARN text_generation_launcher: Could not import Mistral model: cannot import name 'cache_ops' from 'vllm' (unknown location)

2023-11-25T10:55:06.242592Z ERROR text_generation_launcher: Error when initializing model

so it appears I cannot get away with neglecting make install-vllm even though last time that's what downgraded cuda and pytorch. Is there a newer commit for vllm than this?

https://github.com/huggingface/text-generation-inference/blob/3c02262f29fba3bf1096f22017f70d59ff4daa4d/server/Makefile-vllm#L1-L2

Because this installed every nvidia-*-cu11 package in my conda environment and had me getting errors like

/home/thomas/miniconda3/envs/text-generation-inference/include/cuda_bf16.hpp:579:9: error: ‘__internal_device_bfloat162float’ was not declared in this scope; did you mean ‘__internal_bfloat162float’?
        579 |     f = __internal_device_bfloat162float(h);
odellus commented 1 year ago

Should I git checkout tags/v1.1.1 -b v1.1.1-branch instead of working from main?

odellus commented 1 year ago

So I was able to install vllm from a new commit that is mentioned in #1285 and the only problem I'm seeing from pip is

text-generation-server 1.1.1 requires tokenizers<0.14.0,>=0.13.3, but you have tokenizers 0.14.1 which is incompatible.

but I figured let's try it out anyway so I go back to root and run

target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4

and now I'm getting

2023-11-25T11:59:52.063396Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/layers.py)

so it looks like I need to

cd server/flash-attention-v2/csrc/rotary
python -m pip install .

doing that and now I'm seeing

2023-11-25T12:05:24.429362Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):  
(lots of stuff you don't need to see)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 672, in warmup
    _, batch = self.generate_token(batch)
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 753, in generate_token
    raise e
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 750, in generate_token
    out = self.forward(batch)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_mistral.py", line 343, in forward
    logits = self.model.forward(
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 512, in forward
    hidden_states = self.model(
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 457, in forward
    hidden_states, residual = layer(
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 382, in forward
    attn_output = self.self_attn(
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 272, in forward
    paged_attention.reshape_and_cache(
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/paged_attention.py", line 12, in reshape_and_cache
    cache_ops.reshape_and_cache(
RuntimeError: expected scalar type Long but found Int

which is a new error and a sign of the forward march of progress!

odellus commented 1 year ago

seems to be a vllm interop issue, which makes sense as I'm working on main lol

odellus commented 1 year ago

I changed paged_attention.py to be sure to pass slots.type(torch.LongTensor) intead of simple slots because of the above error, and now I'm getting

  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/flash_attn.py", line 63, in attention
    return flash_attn_2_cuda.varlen_fwd(
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."))

so I

export CUDA_LAUNCH_BLOCKING=1
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4

and I still see

File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 382, in forward
    attn_output = self.self_attn(
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 282, in forward
    flash_attn.attention(
  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/flash_attn.py", line 63, in attention
    return flash_attn_2_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

which looks an awful lot like a problem with flash-attention's varlen_fwd and not text-generation-inference

If I try to run

target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes
# no nf4

I get an OOM error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 7.79 GiB 
of which 127.38 MiB is free. Including non-PyTorch memory, this process has 7.56 GiB memory in use. Of the allocated 
memory 6.80 GiB is allocated by PyTorch, and 596.35 MiB is reserved by PyTorch but unallocated. If reserved but 
unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

so I guess I need to go unit test varlen_fwd in the flash attention repo.

odellus commented 1 year ago

Okay so I just ran

cd server/flash-attention-v2/tests
pytest

and after installing the correct versions of torchvision and timm along with building the fused_dense_lib module, I'm getting only a single error

import file mismatch:
imported module 'test_rotary' has this __file__ attribute:
  /home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/flash-attention-v2/tests/test_rotary.py
which is not the same as the test file we want to collect:
  /home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/flash-attention-v2/tests/layers/test_rotary.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

which doesn't give me much help in tracking down why I'm getting illegal memory access errors from varlen_fwd when tgi warms up. How big of a batch is tgi using? Could this be a simple OOM that's getting obfuscated somehow?

odellus commented 1 year ago

Thinking this is my problem as I am not using an A100

https://github.com/huggingface/text-generation-inference/blob/3c02262f29fba3bf1096f22017f70d59ff4daa4d/router/client/src/client.rs#L98-L102

I set --max-batch-total-tokens 4096 (set batch size to 1) and still see

========= Invalid __global__ read of size 8 bytes
=========     at 0x50 in void vllm::reshape_and_cache_kernel<c10::Half>(const T1 *, const T1 *, T1 *, T1 *, const long *, int, int, int, int, int, int)
=========     by thread (35,0,0) in block (0,0,0)
=========     Address 0x4b38cfc0 is out of bounds
=========     and is 140,009,470,898,240 bytes before the nearest allocation at 0x7f56ca000000 of size 20,971,520 bytes

I've fiddled around trying to get rid of the warm up step. Even took out the self.generate_token(batch) call in warmup in flash_causal_lm.py as pulling it out of the rust code entirely left me without an initialized cache manager. Started getting OOMs from initializing the cache manager. Putting it back in and I get the above errors. So I guess I'm working on a low resource warmup right now.

The thing is, I can use the model through Python and the flash attention unit tests are all passing.

After some extensive testing with the model in Python, I find I can only process sequences of around 2000 tokens with the 8 Gb mobile GPU. Any more and I OOM.

If I set --max-batch-total-tokens 2048 and --max-batch-prefill-tokens 1024 I still see the same illegal memory access even though I shouldn't be maxing out the GPU memory if my investigations in Python tell me anything?

MiladMolazadeh commented 1 year ago

Model description

I'm using neural-chat-7b-v3-1 locally on my laptop and it would sure be sweet if I could serve it through tgi.

I can currently use it with python using the pattern

import torch
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

fpath = '/home/thomas/src/neural-chat-7b-v3-1'

tokenizer = AutoTokenizer.from_pretrained(fpath)
model = AutoModelForCausalLM.from_pretrained(
    fpath, 
    device_map = 'auto',
    quantization_config = quantization_config,
)

but when I try to pass the path of the repo I cloned through to tgi I get

(text-generation-inference) thomas@computer-1:~/src/notes/projects/assistant/backend/text-generation-inference$ target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4
2023-11-24T15:56:31.756672Z  INFO text_generation_launcher: Args { model_id: "/home/thomas/src/neural-chat-7b-v3-1", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-11-24T15:56:31.756747Z  INFO download: text_generation_launcher: Starting download process.
2023-11-24T15:56:33.786649Z  INFO text_generation_launcher: Peft model detected.

2023-11-24T15:56:33.786688Z  INFO text_generation_launcher: Loading the model it might take a while without feedback

2023-11-24T15:56:34.161062Z ERROR download: text_generation_launcher: Download encountered an error: Traceback (most recent call last):

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 117, in from_pretrained
    config_file = hf_hub_download(

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/thomas/src/neural-chat-7b-v3-1'. Use `repo_type` argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/peft.py", line 16, in download_and_unload_peft
    model = AutoPeftModelForCausalLM.from_pretrained(

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/auto.py", line 69, in from_pretrained
    peft_config = PeftConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)

  File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 121, in from_pretrained
    raise ValueError(f"Can't find '{CONFIG_NAME}' at '{pretrained_model_name_or_path}'")

ValueError: Can't find 'adapter_config.json' at '/home/thomas/src/neural-chat-7b-v3-1'

So I'm seeing an error that appears to be related to #1283 in addition to tgi complaining there's no adapter_config.json which is odd because the repo has the full model and is not a peft adapter. But I mean it's doesn't even look like it can see the local repo so I don't know.

Open source status

  • [x] The model implementation is available
  • [x] The model weights are available

Provide useful links for the implementation

https://huggingface.co/Intel/neural-chat-7b-v3-1

I encountered this exact error while running it from Docker.

odellus commented 1 year ago

I was able to get around the error where text-generation-launcher is trying to read in the path like it is a PEFT LoRA model by commenting out the following lines from server/text-generation-server/cli.py. I'd personally try to load a local path as normal weights before trying to treat it like a PEFT model, but what do I know.

https://github.com/huggingface/text-generation-inference/blob/3c02262f29fba3bf1096f22017f70d59ff4daa4d/server/text_generation_server/cli.py#L153-L163

Currently stuck installing dropout-layer-norm package. Had to cd server && make install-vllm && make install-flash-attention-v2, which does not actually install dropout-layer-norm. To try that I had to follow the instructions found here. But I'm having issues because when I try to install xformers for vllm it downgrades transformers to 4.33 and cuda to 11.7, which breaks my installation of flash-attention I compiled with pytorch-2.1 and cuda-12.1. So I'm basically in dependency hell and could use a clear set of steps to actually run mistral models on tgi please.

if you run it locally you can comment out the offending code here

poojitharamachandra commented 1 year ago

hi,

i run the below cmd and get an error

text-generation-launcher --model-id ~/models/beluga --port=8080 --quantize bitsandbytes

RuntimeError: unable to mmap 9976578928 bytes from file </home/models/beluga/model-00001-of-00002.safetensors>: Cannot allocate memory (12)

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

rvsoni commented 2 months ago

Any update on this, i get same error on Intel/neural-chat-7b-v3-3 model