huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
196 stars 59 forks source link

TGI NeuronX DLC (Optimum-neuron) 0.0.20: SageMaker deployment failure with llama-2 7B #516

Open Neo9061 opened 6 months ago

Neo9061 commented 6 months ago

System Info

I precompiled my artifacts to have following format under sequence length of 4K, batch size of 4, and tensor parallel degree of 2. The same set of code works for previous version of DLC 0.0.17

checkpoint\
compiled\
config.json
...

During deployment in SageMaker endpoint, we have following error.

2024-03-14T20:30:37.662Z
#033[2m2024-03-14T20:30:37.453335Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.

Copy
#033[2m2024-03-14T20:30:37.453335Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.

2024-03-14T20:30:37.662Z
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve
    serve(model_path, uds_path)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve
    asyncio.run(serve_inner(model_path))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(

Copy
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(

2024-03-14T20:30:37.662Z    > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(

2024-03-14T20:30:37.662Z
ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.

Copy
ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.

Here is the ENV variable I specified during SageMaker endpoint deployment.

"ENDPOINT_SERVER_TIMEOUT": "3600",
"HF_MODEL_ID": "/opt/ml/model",
"MODEL_CACHE_ROOT": "/opt/ml/model",
"SAGEMAKER_ENV": "1",


### Who can help?

@dacorvo @philschmid 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

model_data = {
    'S3DataSource': {
        'CompressionType': 'None',
        'S3DataType': 'S3Prefix',
        'S3Uri': <S3-url>,
    }
}
endpoint_name = name_from_base(f"opt-neuron")
model = Model(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.20-neuronx-py310-ubuntu22.04",
    model_data=model_data,
    role=role,
    sagemaker_session=sagemaker_session,
    name=endpoint_name,
    env={
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        "HF_MODEL_ID": "/opt/ml/model",
        "MODEL_CACHE_ROOT": "/opt/ml/model",
        "SAGEMAKER_ENV": "1",
    },
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    volume_size=512,
    model_data_download_timeout=3600,
    container_startup_health_check_timeout=3600,
)

### Expected behavior

It should deploy successfully on SageMaker endpoint
dacorvo commented 6 months ago

You need to set MAX_BATCH_SIZE = batch_size to avoid this. I am very surprised it worked in 0.0.17 without adjusting MAX_BATCH_PREFILL_TOKENS adn MAX_BATCH_TOTAL_TOKENS.

dacorvo commented 6 months ago

See the documentation here: https://github.com/huggingface/optimum-neuron/blob/main/text-generation-inference/README.md#using-a-standard-model-from-the--huggingface-hub

Neo9061 commented 6 months ago

Hi @dacorvo I retried with MAX_BATCH_SIZE with following ENVs.

  env={
      "ENDPOINT_SERVER_TIMEOUT": "3600",
      "HF_MODEL_ID": "/opt/ml/model", 
      "MODEL_CACHE_ROOT": "/opt/ml/model",
      "SAGEMAKER_ENV": "1",
      "HF_NUM_CORES": "2",
      "HF_BATCH_SIZE": "4",
      "HF_SEQUENCE_LENGTH": "4096",
      "HF_AUTO_CAST_TYPE": "bf16",
  },

But still failed with following logs.


Traceback (most recent call last):  File "/usr/local/bin/text-generation-server", line 8, in <module>    sys.exit(app())  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__    return get_command(self)(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__    return self.main(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main    return _main(  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main    rv = self.invoke(ctx)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke    return _process_result(sub_ctx.command.invoke(sub_ctx))  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke    return ctx.invoke(self.callback, **ctx.params)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke    return __callback(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper    return callback(**use_params)  # type: ignore  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve    serve(model_path, uds_path)  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve    asyncio.run(serve_inner(model_path))  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run    return loop.run_until_complete(main)  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete    self.run_forever()  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever    self._run_once()  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once    handle._run()  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run    self._context.run(self._callback, *self._args)  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method    return await self.intercept(Copy | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
-- | --
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept    return await response  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill    generations, batch = self.generator.prefill(request.batch)  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill    raise ValueError(Copy | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
dacorvo commented 6 months ago

You did not set MAX_BATCH_SIZE, but only HF_BATCH_SIZE. The link I sent was not explanatory enough.

Neo9061 commented 6 months ago

MY BAD! I thought I need specify batch size but didn't pay attention that there are two types of batch size. Retrying

dacorvo commented 6 months ago

MY BAD! I thought I need specify batch size but didn't pay attention that there are two types of batch size. Retrying

I had posted the wrong link: fixed it.

There are two types of env variables:

Neo9061 commented 6 months ago

I specified following ENVs.

  env={
      "ENDPOINT_SERVER_TIMEOUT": "3600",
      "HF_MODEL_ID": "/opt/ml/model", 
      "MODEL_CACHE_ROOT": "/opt/ml/model",
      "SAGEMAKER_ENV": "1",
      "HF_NUM_CORES": "2",
      "HF_BATCH_SIZE": "4",
      "MAX_BATCH_SIZE": "4",
      "HF_SEQUENCE_LENGTH": "4096",
      "HF_AUTO_CAST_TYPE": "bf16",
      "MAX_TOTAL_TOKENS": "4096",
      "MAX_INPUT_LENGTH": "512",

Correspondingly, the logs in CloudWatch shows following, which is a good indicator the ENVs are passed through. (Except the max_batch_size: Some(4)? what is Some)

#033[2m2024-03-15T14:08:48.239012Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

It still says


Method Prefill encountered an error.
--
  | 2024-03-15T14:10:49.762Z | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
  | 2024-03-15T14:10:49.762Z | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
  | 2024-03-15T14:10:49.762Z | ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
dacorvo commented 6 months ago

It is very difficult to investigate without the full logs. What configuration was detected for the neuron model ? You should have one log saying what the model batch_size is in the neuron config

Neo9061 commented 6 months ago

Here are the full log David. The config file of my compiled llama is following.

{
  "_name_or_path": "llama-2-7b/config.json",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "neuron": {
    "auto_cast_type": "fp16",
    "batch_size": 4,
    "checkpoint_id": null,
    "checkpoint_revision": null,
    "compiler_type": "neuronx-cc",
    "compiler_version": "2.12.68.0+4480452af",
    "num_cores": 2,
    "sequence_length": 2048,
    "task": "text-generation"
  },
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "use_cache": true,
  "vocab_size": 32000
}
2024-03-18T14:20:46.018Z#033[2m2024-03-18T14:20:45.480377Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }Copy | #033[2m2024-03-18T14:20:45.480377Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
-- | --
  | 2024-03-18T14:20:46.018Z#033[2m2024-03-18T14:20:45.480462Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.Copy | #033[2m2024-03-18T14:20:45.480462Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
  | 2024-03-18T14:20:47.632Z#033[2m2024-03-18T14:20:45.726773Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.Copy | #033[2m2024-03-18T14:20:45.726773Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
  | 2024-03-18T14:20:47.632Z#033[2m2024-03-18T14:20:47.484110Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.Copy | #033[2m2024-03-18T14:20:47.484110Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.
  | 2024-03-18T14:20:49.134Z#033[2m2024-03-18T14:20:47.484395Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:20:47.484395Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:20:54.016Z#033[2m2024-03-18T14:20:48.971274Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).Copy | #033[2m2024-03-18T14:20:48.971274Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).
  | 2024-03-18T14:21:02.016Z#033[2m2024-03-18T14:20:57.493731Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:20:57.493731Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:21:12.016Z#033[2m2024-03-18T14:21:07.502784Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:07.502784Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:21:22.016Z#033[2m2024-03-18T14:21:17.513233Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:17.513233Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:21:32.016Z#033[2m2024-03-18T14:21:27.522705Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:27.522705Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:21:42.016Z#033[2m2024-03-18T14:21:37.531997Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:37.531997Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:21:52.016Z#033[2m2024-03-18T14:21:47.541664Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:47.541664Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:22:02.016Z#033[2m2024-03-18T14:21:57.550098Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:57.550098Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:22:09.314Z#033[2m2024-03-18T14:22:07.559149Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:22:07.559149Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:22:09.314Z#033[2m2024-03-18T14:22:09.194489Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Model successfully loaded in 80.22 s.Copy | #033[2m2024-03-18T14:22:09.194489Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Model successfully loaded in 80.22 s.
  | 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.269178Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0Copy | #033[2m2024-03-18T14:22:09.269178Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0
  | 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.360615Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 81.874778525s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:22:09.360615Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 81.874778525s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
  | 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.448943Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting WebserverCopy | #033[2m2024-03-18T14:22:09.448943Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver
  | 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.508087Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m237:#033[0m Using local tokenizer configCopy | #033[2m2024-03-18T14:22:09.508087Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m237:#033[0m Using local tokenizer config
  | 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.508174Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m272:#033[0m no pipeline tag found for model /opt/ml/modelCopy | #033[2m2024-03-18T14:22:09.508174Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m272:#033[0m no pipeline tag found for model /opt/ml/model
  | 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:09.510446Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m291:#033[0m Warming up modelCopy | #033[2m2024-03-18T14:22:09.510446Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m291:#033[0m Warming up model
  | 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:11.578534Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m328:#033[0m Setting max batch total tokens to 8192Copy | #033[2m2024-03-18T14:22:11.578534Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m328:#033[0m Setting max batch total tokens to 8192
  | 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:11.578549Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m329:#033[0m ConnectedCopy | #033[2m2024-03-18T14:22:11.578549Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m329:#033[0m Connected
  | 2024-03-18T14:22:12.824Z#033[2m2024-03-18T14:22:11.578553Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m334:#033[0m Invalid hostname, defaulting to 0.0.0.0Copy | #033[2m2024-03-18T14:22:11.578553Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m334:#033[0m Invalid hostname, defaulting to 0.0.0.0
  | 2024-03-18T14:22:12.824Z#033[2m2024-03-18T14:22:12.646846Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.Copy | #033[2m2024-03-18T14:22:12.646846Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.
  | 2024-03-18T14:22:12.824ZTraceback (most recent call last):  File "/usr/local/bin/text-generation-server", line 8, in <module>    sys.exit(app())  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__    return get_command(self)(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__    return self.main(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main    return _main(  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main    rv = self.invoke(ctx)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke    return _process_result(sub_ctx.command.invoke(sub_ctx))  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke    return ctx.invoke(self.callback, **ctx.params)  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke    return __callback(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper    return callback(**use_params)  # type: ignore  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve    serve(model_path, uds_path)  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve    asyncio.run(serve_inner(model_path))  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run    return loop.run_until_complete(main)  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete    self.run_forever()  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever    self._run_once()  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once    handle._run()  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run    self._context.run(self._callback, *self._args)  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method    return await self.intercept(Copy | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
  | 2024-03-18T14:22:12.824Z> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept    return await response  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill    generations, batch = self.generator.prefill(request.batch)  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill    raise ValueError(Copy | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
  | 2024-03-18T14:22:12.824ZValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.Copy | ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
Neo9061 commented 6 months ago

Here is the ENVs for my SageMaker deployment on inf2.xlarge.

    env={
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        "HF_MODEL_ID": "/opt/ml/model",
        "MODEL_CACHE_ROOT": "/opt/ml/model",
        "SAGEMAKER_ENV": "1",
        "HF_NUM_CORES": "2",
        "HF_BATCH_SIZE": "4",
        "MAX_BATCH_SIZE": "4",
        "HF_SEQUENCE_LENGTH": "2048",
        "HF_AUTO_CAST_TYPE": "bf16",
        "MAX_TOTAL_TOKENS": "2048",
        "MAX_INPUT_LENGTH": "512",
    },
dacorvo commented 6 months ago

I am reproducing the error during warmup when there is a big prefill request: try setting also "MAX_BATCH_PREFILL_TOKENS" to something below batch_size sequence_length (like batch_size sequence_length // 2).

Neo9061 commented 6 months ago

Thanks David. The endpoint is deployed successfully with MAX_BATCH_PREFILL_TOKENS to be 512. However, it failed with following error when query an short payload shown as below. Please let me know if you need any other logging info from my end.

{'inputs': 'I believe the meaning of life is',
 'parameters': {'max_new_tokens': 64,
  'top_p': 0.9,
  'temperature': 0.6,
  'decoder_input_details': True,
  'details': True}}
thread 'tokio-runtime-worker' panicked at router/src/infer.rs:598:14:
--
ID not found in entries. This is a bug.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
#033[2m2024-03-18T18:17:46.703406Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Webserver Crashed
#033[2m2024-03-18T18:17:46.703430Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
#033[2m2024-03-18T18:17:47.020107Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard terminated #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
Error: WebserverFailed
#033[2m2024-03-18T18:17:47.192339Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
#033[2m2024-03-18T18:17:47.192435Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2024-03-18T18:17:47.301117Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
#033[2m2024-03-18T18:17:49.495696Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.
#033[2m2024-03-18T18:17:49.495937Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-03-18T18:17:51.044167Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).
dacorvo commented 6 months ago

The issue arises during the server warmup, attempting to reach the maximum capacity. For some reason, it looks like the server ignores the MAX_BATCH_SIZE variable and evaluates the number of requests to reach maximum capacity by dividing an inferred quantity by MAX_INPUT_TOKENS. It seems that without anything else but MAX_INPUT_TOKENS, MAX_TOTAl_TOKENS specified, that quantity is the maximum number of tokens for the model (4096). I will need to sort that out, but in the meantime I suggest you always set MAX_TOTAL_TOKENS = 4096.

dacorvo commented 6 months ago

The deployment is actually successful, but the warmup phase doesn't end cleanly, leaving pending requests in the decoding queue that the router is not aware of. These requests will be purged on the very first failing incoming request (only the first client will get the error you mentioned). This is fixed by #522