huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.76k stars 1.02k forks source link

LLama 3/3.1 70B Outputting "!!!!!!"; Shorter Context #2312

Open mallorbc opened 1 month ago

mallorbc commented 1 month ago

System Info

text-generation-launcher --env 2024-07-26T03:39:42.960734Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.79.0 Commit sha: 3905f854ed49b0bc50e6c983d3e6b254fcf02288 Docker label: sha-3905f85 nvidia-smi: Fri Jul 26 03:39:42 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:0E:00.0 Off | N/A | | 49% 62C P2 113W / 350W | 22668MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:0F:00.0 Off | N/A | | 30% 53C P2 102W / 350W | 21924MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-07-26T03:39:42.960780Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "4e90e37e133c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/root/.cache", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }

I am using Ubuntu 22.04

I have two RTX 3090s.

I am using the latest docker images as of 7/25/24.

Output from pip list: Package Version


accelerate 0.29.3 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 archspec 0.2.3 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.2 boltons 24.0.0 Brotli 1.1.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 conda 24.5.0 conda-libmamba-solver 24.1.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 datasets 2.20.0 Deprecated 1.2.14 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 einops 0.6.1 filelock 3.15.4 frozendict 2.4.4 frozenlist 1.4.1 fsspec 2024.5.0 gmpy2 2.1.5 googleapis-common-protos 1.63.2 grpc-interceptor 0.15.4 grpcio 1.65.1 grpcio-reflection 1.62.2 grpcio-status 1.62.2 grpcio-tools 1.62.2 hf_transfer 0.1.8 huggingface-hub 0.23.5 idna 3.7 importlib_metadata 7.1.0 interegular 0.3.3 Jinja2 3.1.4 joblib 1.4.2 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 lark 1.1.9 libmambapy 1.5.8 llvmlite 0.43.0 loguru 0.6.0 mamba 1.5.8 MarkupSafe 2.1.5 menuinst 2.0.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-protobuf 3.6.0 nest-asyncio 1.6.0 networkx 3.3 numba 0.60.0 numpy 1.26.4 nvidia-nccl-cu12 2.22.3 opentelemetry-api 1.25.0 opentelemetry-exporter-otlp 1.25.0 opentelemetry-exporter-otlp-proto-common 1.25.0 opentelemetry-exporter-otlp-proto-grpc 1.25.0 opentelemetry-exporter-otlp-proto-http 1.25.0 opentelemetry-instrumentation 0.46b0 opentelemetry-instrumentation-grpc 0.46b0 opentelemetry-proto 1.25.0 opentelemetry-sdk 1.25.0 opentelemetry-semantic-conventions 0.46b0 outlines 0.0.34 packaging 24.1 pandas 2.2.2 peft 0.10.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 prometheus_client 0.20.0 protobuf 4.25.3 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pycosat 0.6.6 pycparser 2.22 pydantic 2.8.2 pydantic_core 2.20.1 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rpds-py 0.19.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 safetensors 0.4.3 scipy 1.13.1 sentencepiece 0.1.99 setuptools 71.1.0 six 1.16.0 sympy 1.13.0 text-generation-server 2.0.5.dev0 texttable 1.7.0 tokenizers 0.19.1 torch 2.4.0 tqdm 4.66.4 transformers 4.43.1 triton 3.0.0 truststore 0.8.0 typer 0.6.1 types-protobuf 5.27.0.20240626 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 wheel 0.43.0 wrapt 1.16.0 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2 zstandard 0.22.0

Information

Tasks

Reproduction

Run TGI like the following: --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --huggingface-hub-cache /root/.cache/huggingface/hub --trust-remote-code --max-input-length 2047 --max-total-tokens 2048 --quantize bitsandbytes-nf4

Query the model and notice that a high percentage of the time you get "!!!!" but not always.

Also, I use to be able to run LLama 3 70B models with 5k context with dual 3090s. Now I can not run the LLama 2 70B models without error most of the time, but It won't even fully load unless I drop the context window to something like 4k

Expected behavior

I expect to be able to use the model as expected and the context window should expand or stay the same with updates, not go down.

mallorbc commented 1 month ago

This just happened with the 8B model too. I am thinking it may have something to do with bits and bytes but I am not sure.

mallorbc commented 1 month ago

Happening when not using quantization as well. Still pseudo-random.

mallorbc commented 1 month ago

Rebooting somtimes helps. Maybe its a hardware issue.

danieldk commented 1 month ago

Interesting. I haven't found this issue with 8B on A10g or 405B H100. Would be curious to know if it's indeed a hardware issue.

mallorbc commented 1 month ago

It didn't happen on previous versions so if it is hardware related, it's a recent development or it's a bug introduced.

maziyarpanahi commented 1 month ago

I am having an issue with the quality, they are not as good as Llama-3-70B.