huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

HuggingFaceH4/zephyr-7b-beta issue "Method prefill encountered error" #1566

Closed muhammad-asn closed 6 months ago

muhammad-asn commented 6 months ago

System Info

Text-generation-inference: v1.1.0 OS: Ubuntu 22.04.3 Nvidia driver: 545.23.08 CUDA version: 12.3 Chat UI: https://github.com/huggingface/chat-ui/ tag v.0.7 GPU: A100 80GB

Information

Tasks

Reproduction

  1. Deploy TGI with this config
    services:
    llm:
    image: ghcr.io/huggingface/text-generation-inference:1.1.0
    container_name: llm
    command: >
      --model-id HuggingFaceH4/zephyr-7b-beta 
      --max-total-tokens 8192 
      --max-input-length 4096 
      --num-shard 1 
      --max-top-n-tokens 1 
      --max-best-of 1 
      --disable-custom-kernels 
      --trust-remote-code  
      --max-stop-sequences 1  
      --validation-workers 1 
      --waiting-served-ratio 0 
      --max-batch-total-tokens 8192 
      --max-batch-prefill-tokens 4096 
      --max-waiting-tokens 4096 
      --cuda-memory-fraction 0.8
      --max-concurrent-requests 512
    volumes:
      - ./data:/data
    ports:
      - 8080:80
    shm_size: '1gb'
    restart: always
    environment:
      - CUDA_LAUNCH_BLOCKING=1
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
  2. Deploy Chat UI with this config (.env.local)
    
    MONGODB_URL=mongodb://mongodb:27017
    MONGODB_DB_NAME=chat-ui
    MONGODB_DIRECT_CONNECTION=false

MODELS=[ { "name": "HuggingFaceH4/zephyr-7b-beta", "displayName": "HuggingFaceH4/zephyr-7b-beta", "description": "Zephyr 7b Beta Model", "websiteUrl": "https://huggingface.co/HuggingFaceH4/zephyr-7b-beta", "preprompt": "", "parameters": { "details": false, "do_sample": true, "max_new_tokens": 4096, "best_of": 1, "repetition_penalty": 1.17, "return_full_text": false, "temperature": 0.01, "top_p": 0.14, "top_k": 49, "truncate": 4096, "typical_p": 0.99, "watermark": false, "decoder_input_details": false }, "endpoints": [{ "type" : "tgi", "url": "http://llm/generate_stream" }], "promptExamples": [ { "title": "Write an email from bullet list", "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)" }, { "title": "Code a snake game", "prompt": "Code a basic snake game in python, give explanations for each step." }, { "title": "Assist in a task", "prompt": "How do I make a delicious lemon cheesecake?" } ] } ]

THIS LINE BELOW IS MANDATORY FOR DEFAULT VALUE

OLD_MODELS=[]# any removed models, { name: string, displayName?: string, id?: string } TASK_MODEL= # name of the model used for tasks such as summarizing title, creating query, etc.

PUBLIC_ORIGIN=#https://huggingface.co PUBLIC_SHARE_PREFIX=#https://hf.co/chat PUBLIC_GOOGLE_ANALYTICS_ID=#G-XXXXXXXX / Leave empty to disable PUBLIC_ANNOUNCEMENT_BANNERS=[ { "title": "Llama v2 is live on HuggingChat! 🦙", "linkTitle": "Announcement", "linkHref": "https://huggingface.co/blog/llama2" } ]

PARQUET_EXPORT_DATASET= PARQUET_EXPORT_HF_TOKEN= PARQUET_EXPORT_SECRET=

RATE_LIMIT= # requests per minute MESSAGES_BEFORE_LOGIN=# how many messages a user can send in a conversation before having to login. set to 0 to force login right away

PUBLIC_APP_NAME=ChatUI # name used as title throughout the app PUBLIC_APP_ASSETS=chatui # used to find logos & favicons in static/$PUBLIC_APP_ASSETS PUBLIC_APP_COLOR=blue # can be any of tailwind colors: https://tailwindcss.com/docs/customizing-colors#default-color-palette PUBLIC_APP_DESCRIPTION=# description used throughout the app (if not set, a default one will be used) PUBLIC_APP_DATA_SHARING=#set to 1 to enable options & text regarding data sharing PUBLIC_APP_DISCLAIMER=#set to 1 to show a disclaimer on login page LLM_SUMMERIZATION=true

COOKIE_NAME=hf-chat HFTOKEN=#hf from from https://huggingface.co/settings/token HF_API_ROOT=https://api-inference.huggingface.co/models OPENAI_API_KEY=#your openai api key here

HF_ACCESS_TOKEN=#LEGACY! Use HF_TOKEN instead

used to activate search with web functionality. disabled if none are defined. choose one of the following:

YDC_API_KEY=#your docs.you.com api key here SERPER_API_KEY=#your serper.dev api key here SERPAPI_KEY=#your serpapi key here SERPSTACK_API_KEY=#your serpstack api key here USE_LOCAL_WEBSEARCH=#set to true to parse google results yourself, overrides other API keys

WEBSEARCH_ALLOWLIST=[] # if it's defined, allow websites from only this list. WEBSEARCH_BLOCKLIST=[] # if it's defined, block websites from this list.

Parameters to enable open id login

OPENID_CONFIG={ "PROVIDER_URL": "", "CLIENT_ID": "", "CLIENT_SECRET": "", "SCOPES": "" }

/!\ legacy openid settings, prefer the config above

OPENID_CLIENT_ID= OPENID_CLIENT_SECRET= OPENID_SCOPES="openid profile" # Add "email" for some providers like Google that do not provide preferred_username OPENID_PROVIDER_URL=https://huggingface.co # for Google, use https://accounts.google.com OPENID_TOLERANCE= OPENID_RESOURCE=

Parameters to enable a global mTLS context for client fetch requests

USE_CLIENT_CERTIFICATE=false CERT_PATH=# KEY_PATH=# CA_PATH=# CLIENT_KEY_PASSWORD=# REJECT_UNAUTHORIZED=true


3. Run multiple concurrent requests, after a while it shows an error

Remove prefill token due to error

2024-02-16T06:37:34.615391Z ERROR health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-02-16T06:37:52.693170Z ERROR text_generation_launcher: Method Prefill encountered an error.



### Expected behavior

The TGI should be run seamlessly and without issue.
muhammad-asn commented 6 months ago

Any idea @OlivierDehaene @Narsil? Or should I upgrade to the next version of TGI?

muhammad-asn commented 6 months ago

Any update?

OlivierDehaene commented 6 months ago

If you could try with the latest version that would be great! 1.4.2

muhammad-asn commented 6 months ago

@OlivierDehaene can you assist which line that cause the bug or issue?

OlivierDehaene commented 6 months ago
services:
  llm:
    image: ghcr.io/huggingface/text-generation-inference:1.4.2
    container_name: llm
    command: >
      --model-id HuggingFaceH4/zephyr-7b-beta 
      --max-total-tokens 8192 
      --max-input-length 4096 
      --num-shard 1 
      --max-top-n-tokens 1 
      --max-best-of 1 
      --disable-custom-kernels 
      --trust-remote-code  
      --max-stop-sequences 1  
      --validation-workers 1 
      --waiting-served-ratio 0 
      --max-batch-total-tokens 8192 
      --max-batch-prefill-tokens 4096 
      --max-waiting-tokens 4096 
      --cuda-memory-fraction 0.8
      --max-concurrent-requests 512
    volumes:
      - ./data:/data
    ports:
      - 8080:80
    shm_size: '1gb'
    restart: always
    environment:
      - CUDA_LAUNCH_BLOCKING=1
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
muhammad-asn commented 6 months ago

Sorry 😅 @OlivierDehaene what I mean is what is the feature / part of the code that makes the version 1.1.0 or 1.1.1 causing the issue and 1.4.2 not?

muhammad-asn commented 6 months ago

v1.1.1: https://github.com/huggingface/text-generation-inference/blob/v1.1.1/server/text_generation_server/models/flash_mistral.py#L201-L203

v1.4.3: https://github.com/huggingface/text-generation-inference/blob/v1.1.1/server/text_generation_server/models/flash_mistral.py#L193-L203

The code still same between 1.1.1. and 1.4.3. Could you please explain it?

The issue is on this part of code text_generation_server/models/flash_mistral.py" line 201

    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 90, in Prefill
    batch = self.model.batch_type.from_pb(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_mistral.py", line 201, in from_pb
    all_input_ids_tensor = torch.tensor(
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
michaelact commented 6 months ago

Hi @OlivierDehaene, I can confirm that the issue is still persists, even after upgrading from version 1.1.1 to 1.4.3

OlivierDehaene commented 6 months ago

I'm unable to reproduce the issue on my side both with 1.1.1 and 1.4.3. Can you share an easily reproducible example?

michaelact commented 6 months ago

I'll share the details later, @OlivierDehaene . Can you assist me in compiling with TORCH_USE_CUDA_DSA to enable device-side assertions using a Dockerfile? Which line needs to be changed?

I tried to modify the python setup.py build line to TORCH_USE_CUDA_DSA=1 python setup.py build but it still doesn't enabled, the logs still appear.

muhammad-asn commented 6 months ago

@OlivierDehaene sorry for late reply, after we reproduced the issue, it came from the input prompt that the total actual tokens is almost near the default (4096). So we adjust the flags to 2x (8192).

Thank you for your assistance