Closed odellus closed 11 months ago
This is a mistral model so it should be supported, correct?
I was able to get around the error where text-generation-launcher is trying to read in the path like it is a PEFT LoRA model by commenting out the following lines from server/text-generation-server/cli.py
. I'd personally try to load a local path as normal weights before trying to treat it like a PEFT model, but what do I know.
https://github.com/huggingface/text-generation-inference/blob/3c02262f29fba3bf1096f22017f70d59ff4daa4d/server/text_generation_server/cli.py#L153-L163
Currently stuck installing dropout-layer-norm package. Had to cd server && make install-vllm && make install-flash-attention-v2
, which does not actually install dropout-layer-norm. To try that I had to follow the instructions found here. But I'm having issues because when I try to install xformers for vllm it downgrades transformers to 4.33 and cuda to 11.7, which breaks my installation of flash-attention I compiled with pytorch-2.1 and cuda-12.1. So I'm basically in dependency hell and could use a clear set of steps to actually run mistral models on tgi please.
I was able to build dropout-layer-norm after creating a fresh environment. Steps I've taken:
# mise en place
git clone https://github.com/huggingface/text-generation-inference
cd text-generation-inference
# Create a new environment
conda create -n text-generation-inference python=3.10
conda activate text-generation-inference
# build tgi server
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# This fails for me the first time because nvcc is not installed
# So we install nvcc
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
# Install it again
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# that should work now that nvcc is installed
# build flash attention v2
cd server
make install-flash-attention-v2
# This will install ninja so set MAX_JOBS
export MAX_JOBS=2
# now build dropout-layer-norm
cd flash-attention-v2/csrc/layer_norm
python -m pip install .
# okay figured it would be a good time to try it out
cd ../../../..
# we should be back in text-generation-inference root directory now
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4
and I'm now seeing
2023-11-25T10:55:06.159659Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'cache_ops' from 'vllm' (unknown location)
2023-11-25T10:55:06.163638Z WARN text_generation_launcher: Could not import Mistral model: cannot import name 'cache_ops' from 'vllm' (unknown location)
2023-11-25T10:55:06.242592Z ERROR text_generation_launcher: Error when initializing model
so it appears I cannot get away with neglecting make install-vllm
even though last time that's what downgraded cuda and pytorch. Is there a newer commit for vllm than this?
Because this installed every nvidia-*-cu11
package in my conda environment and had me getting errors like
/home/thomas/miniconda3/envs/text-generation-inference/include/cuda_bf16.hpp:579:9: error: ‘__internal_device_bfloat162float’ was not declared in this scope; did you mean ‘__internal_bfloat162float’?
579 | f = __internal_device_bfloat162float(h);
Should I git checkout tags/v1.1.1 -b v1.1.1-branch
instead of working from main?
So I was able to install vllm from a new commit that is mentioned in #1285 and the only problem I'm seeing from pip is
text-generation-server 1.1.1 requires tokenizers<0.14.0,>=0.13.3, but you have tokenizers 0.14.1 which is incompatible.
but I figured let's try it out anyway so I go back to root and run
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4
and now I'm getting
2023-11-25T11:59:52.063396Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/layers.py)
so it looks like I need to
cd server/flash-attention-v2/csrc/rotary
python -m pip install .
doing that and now I'm seeing
2023-11-25T12:05:24.429362Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
(lots of stuff you don't need to see)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/server.py", line 72, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 672, in warmup
_, batch = self.generate_token(batch)
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 753, in generate_token
raise e
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 750, in generate_token
out = self.forward(batch)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/flash_mistral.py", line 343, in forward
logits = self.model.forward(
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 512, in forward
hidden_states = self.model(
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 457, in forward
hidden_states, residual = layer(
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 382, in forward
attn_output = self.self_attn(
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 272, in forward
paged_attention.reshape_and_cache(
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/paged_attention.py", line 12, in reshape_and_cache
cache_ops.reshape_and_cache(
RuntimeError: expected scalar type Long but found Int
which is a new error and a sign of the forward march of progress!
seems to be a vllm interop issue, which makes sense as I'm working on main lol
I changed paged_attention.py to be sure to pass slots.type(torch.LongTensor)
intead of simple slots
because of the above error, and now I'm getting
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/flash_attn.py", line 63, in attention
return flash_attn_2_cuda.varlen_fwd(
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."))
so I
export CUDA_LAUNCH_BLOCKING=1
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4
and I still see
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 382, in forward
attn_output = self.self_attn(
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thomas/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 282, in forward
flash_attn.attention(
File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/flash_attn.py", line 63, in attention
return flash_attn_2_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
which looks an awful lot like a problem with flash-attention's varlen_fwd
and not text-generation-inference
If I try to run
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes
# no nf4
I get an OOM error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 7.79 GiB
of which 127.38 MiB is free. Including non-PyTorch memory, this process has 7.56 GiB memory in use. Of the allocated
memory 6.80 GiB is allocated by PyTorch, and 596.35 MiB is reserved by PyTorch but unallocated. If reserved but
unallocated memory is large try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
so I guess I need to go unit test varlen_fwd in the flash attention repo.
Okay so I just ran
cd server/flash-attention-v2/tests
pytest
and after installing the correct versions of torchvision and timm along with building the fused_dense_lib module, I'm getting only a single error
import file mismatch:
imported module 'test_rotary' has this __file__ attribute:
/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/flash-attention-v2/tests/test_rotary.py
which is not the same as the test file we want to collect:
/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/flash-attention-v2/tests/layers/test_rotary.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
which doesn't give me much help in tracking down why I'm getting illegal memory access errors from varlen_fwd when tgi warms up. How big of a batch is tgi using? Could this be a simple OOM that's getting obfuscated somehow?
Thinking this is my problem as I am not using an A100
I set --max-batch-total-tokens 4096
(set batch size to 1) and still see
========= Invalid __global__ read of size 8 bytes
========= at 0x50 in void vllm::reshape_and_cache_kernel<c10::Half>(const T1 *, const T1 *, T1 *, T1 *, const long *, int, int, int, int, int, int)
========= by thread (35,0,0) in block (0,0,0)
========= Address 0x4b38cfc0 is out of bounds
========= and is 140,009,470,898,240 bytes before the nearest allocation at 0x7f56ca000000 of size 20,971,520 bytes
I've fiddled around trying to get rid of the warm up step. Even took out the self.generate_token(batch) call in warmup in flash_causal_lm.py as pulling it out of the rust code entirely left me without an initialized cache manager. Started getting OOMs from initializing the cache manager. Putting it back in and I get the above errors. So I guess I'm working on a low resource warmup right now.
The thing is, I can use the model through Python and the flash attention unit tests are all passing.
After some extensive testing with the model in Python, I find I can only process sequences of around 2000 tokens with the 8 Gb mobile GPU. Any more and I OOM.
If I set --max-batch-total-tokens 2048
and --max-batch-prefill-tokens 1024
I still see the same illegal memory access even though I shouldn't be maxing out the GPU memory if my investigations in Python tell me anything?
Model description
I'm using neural-chat-7b-v3-1 locally on my laptop and it would sure be sweet if I could serve it through tgi.
I can currently use it with python using the pattern
import torch from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) fpath = '/home/thomas/src/neural-chat-7b-v3-1' tokenizer = AutoTokenizer.from_pretrained(fpath) model = AutoModelForCausalLM.from_pretrained( fpath, device_map = 'auto', quantization_config = quantization_config, )
but when I try to pass the path of the repo I cloned through to tgi I get
(text-generation-inference) thomas@computer-1:~/src/notes/projects/assistant/backend/text-generation-inference$ target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4 2023-11-24T15:56:31.756672Z INFO text_generation_launcher: Args { model_id: "/home/thomas/src/neural-chat-7b-v3-1", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false } 2023-11-24T15:56:31.756747Z INFO download: text_generation_launcher: Starting download process. 2023-11-24T15:56:33.786649Z INFO text_generation_launcher: Peft model detected. 2023-11-24T15:56:33.786688Z INFO text_generation_launcher: Loading the model it might take a while without feedback 2023-11-24T15:56:34.161062Z ERROR download: text_generation_launcher: Download encountered an error: Traceback (most recent call last): File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 117, in from_pretrained config_file = hf_hub_download( File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn validate_repo_id(arg_value) File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id raise HFValidationError( huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/thomas/src/neural-chat-7b-v3-1'. Use `repo_type` argument if needed. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/thomas/src/notes/projects/assistant/backend/text-generation-inference/server/text_generation_server/utils/peft.py", line 16, in download_and_unload_peft model = AutoPeftModelForCausalLM.from_pretrained( File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/auto.py", line 69, in from_pretrained peft_config = PeftConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) File "/home/thomas/miniconda3/envs/text-generation-inference/lib/python3.9/site-packages/peft/utils/config.py", line 121, in from_pretrained raise ValueError(f"Can't find '{CONFIG_NAME}' at '{pretrained_model_name_or_path}'") ValueError: Can't find 'adapter_config.json' at '/home/thomas/src/neural-chat-7b-v3-1'
So I'm seeing an error that appears to be related to #1283 in addition to tgi complaining there's no
adapter_config.json
which is odd because the repo has the full model and is not a peft adapter. But I mean it's doesn't even look like it can see the local repo so I don't know.Open source status
- [x] The model implementation is available
- [x] The model weights are available
Provide useful links for the implementation
I encountered this exact error while running it from Docker.
I was able to get around the error where text-generation-launcher is trying to read in the path like it is a PEFT LoRA model by commenting out the following lines from
server/text-generation-server/cli.py
. I'd personally try to load a local path as normal weights before trying to treat it like a PEFT model, but what do I know.Currently stuck installing dropout-layer-norm package. Had to
cd server && make install-vllm && make install-flash-attention-v2
, which does not actually install dropout-layer-norm. To try that I had to follow the instructions found here. But I'm having issues because when I try to install xformers for vllm it downgrades transformers to 4.33 and cuda to 11.7, which breaks my installation of flash-attention I compiled with pytorch-2.1 and cuda-12.1. So I'm basically in dependency hell and could use a clear set of steps to actually run mistral models on tgi please.
if you run it locally you can comment out the offending code here
hi,
i run the below cmd and get an error
text-generation-launcher --model-id ~/models/beluga --port=8080 --quantize bitsandbytes
RuntimeError: unable to mmap 9976578928 bytes from file </home/models/beluga/model-00001-of-00002.safetensors>: Cannot allocate memory (12)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Any update on this, i get same error on Intel/neural-chat-7b-v3-3 model
Model description
I'm using neural-chat-7b-v3-1 locally on my laptop and it would sure be sweet if I could serve it through tgi.
I can currently use it with python using the pattern
but when I try to pass the path of the repo I cloned through to tgi I get
So I'm seeing an error that appears to be related to #1283 in addition to tgi complaining there's no
adapter_config.json
which is odd because the repo has the full model and is not a peft adapter. But I mean it's doesn't even look like it can see the local repo so I don't know.Open source status
Provide useful links for the implementation
https://huggingface.co/Intel/neural-chat-7b-v3-1