Open Neo9061 opened 6 months ago
You need to set MAX_BATCH_SIZE = batch_size
to avoid this. I am very surprised it worked in 0.0.17
without adjusting MAX_BATCH_PREFILL_TOKENS
adn MAX_BATCH_TOTAL_TOKENS
.
Hi @dacorvo I retried with MAX_BATCH_SIZE with following ENVs.
env={
"ENDPOINT_SERVER_TIMEOUT": "3600",
"HF_MODEL_ID": "/opt/ml/model",
"MODEL_CACHE_ROOT": "/opt/ml/model",
"SAGEMAKER_ENV": "1",
"HF_NUM_CORES": "2",
"HF_BATCH_SIZE": "4",
"HF_SEQUENCE_LENGTH": "4096",
"HF_AUTO_CAST_TYPE": "bf16",
},
But still failed with following logs.
Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(Copy | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
-- | --
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(Copy | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
You did not set MAX_BATCH_SIZE
, but only HF_BATCH_SIZE
. The link I sent was not explanatory enough.
MY BAD! I thought I need specify batch size but didn't pay attention that there are two types of batch size. Retrying
MY BAD! I thought I need specify batch size but didn't pay attention that there are two types of batch size. Retrying
I had posted the wrong link: fixed it.
There are two types of env variables:
I specified following ENVs.
env={
"ENDPOINT_SERVER_TIMEOUT": "3600",
"HF_MODEL_ID": "/opt/ml/model",
"MODEL_CACHE_ROOT": "/opt/ml/model",
"SAGEMAKER_ENV": "1",
"HF_NUM_CORES": "2",
"HF_BATCH_SIZE": "4",
"MAX_BATCH_SIZE": "4",
"HF_SEQUENCE_LENGTH": "4096",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_TOTAL_TOKENS": "4096",
"MAX_INPUT_LENGTH": "512",
Correspondingly, the logs in CloudWatch shows following, which is a good indicator the ENVs are passed through. (Except the max_batch_size: Some(4)
? what is Some
)
#033[2m2024-03-15T14:08:48.239012Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
It still says
Method Prefill encountered an error.
--
| 2024-03-15T14:10:49.762Z | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
| 2024-03-15T14:10:49.762Z | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
| 2024-03-15T14:10:49.762Z | ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
It is very difficult to investigate without the full logs. What configuration was detected for the neuron model ? You should have one log saying what the model batch_size is in the neuron config
Here are the full log David. The config file of my compiled llama is following.
{
"_name_or_path": "llama-2-7b/config.json",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "llama",
"neuron": {
"auto_cast_type": "fp16",
"batch_size": 4,
"checkpoint_id": null,
"checkpoint_revision": null,
"compiler_type": "neuronx-cc",
"compiler_version": "2.12.68.0+4480452af",
"num_cores": 2,
"sequence_length": 2048,
"task": "text-generation"
},
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.36.2",
"use_cache": true,
"vocab_size": 32000
}
2024-03-18T14:20:46.018Z#033[2m2024-03-18T14:20:45.480377Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }Copy | #033[2m2024-03-18T14:20:45.480377Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
-- | --
| 2024-03-18T14:20:46.018Z#033[2m2024-03-18T14:20:45.480462Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.Copy | #033[2m2024-03-18T14:20:45.480462Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
| 2024-03-18T14:20:47.632Z#033[2m2024-03-18T14:20:45.726773Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.Copy | #033[2m2024-03-18T14:20:45.726773Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
| 2024-03-18T14:20:47.632Z#033[2m2024-03-18T14:20:47.484110Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.Copy | #033[2m2024-03-18T14:20:47.484110Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.
| 2024-03-18T14:20:49.134Z#033[2m2024-03-18T14:20:47.484395Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:20:47.484395Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:20:54.016Z#033[2m2024-03-18T14:20:48.971274Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).Copy | #033[2m2024-03-18T14:20:48.971274Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).
| 2024-03-18T14:21:02.016Z#033[2m2024-03-18T14:20:57.493731Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:20:57.493731Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:21:12.016Z#033[2m2024-03-18T14:21:07.502784Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:07.502784Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:21:22.016Z#033[2m2024-03-18T14:21:17.513233Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:17.513233Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:21:32.016Z#033[2m2024-03-18T14:21:27.522705Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:27.522705Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:21:42.016Z#033[2m2024-03-18T14:21:37.531997Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:37.531997Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:21:52.016Z#033[2m2024-03-18T14:21:47.541664Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:47.541664Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:22:02.016Z#033[2m2024-03-18T14:21:57.550098Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:21:57.550098Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:22:09.314Z#033[2m2024-03-18T14:22:07.559149Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:22:07.559149Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:22:09.314Z#033[2m2024-03-18T14:22:09.194489Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Model successfully loaded in 80.22 s.Copy | #033[2m2024-03-18T14:22:09.194489Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Model successfully loaded in 80.22 s.
| 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.269178Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0Copy | #033[2m2024-03-18T14:22:09.269178Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0
| 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.360615Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 81.874778525s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0mCopy | #033[2m2024-03-18T14:22:09.360615Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 81.874778525s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
| 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.448943Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting WebserverCopy | #033[2m2024-03-18T14:22:09.448943Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver
| 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.508087Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m237:#033[0m Using local tokenizer configCopy | #033[2m2024-03-18T14:22:09.508087Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m237:#033[0m Using local tokenizer config
| 2024-03-18T14:22:09.565Z#033[2m2024-03-18T14:22:09.508174Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m272:#033[0m no pipeline tag found for model /opt/ml/modelCopy | #033[2m2024-03-18T14:22:09.508174Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m272:#033[0m no pipeline tag found for model /opt/ml/model
| 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:09.510446Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m291:#033[0m Warming up modelCopy | #033[2m2024-03-18T14:22:09.510446Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m291:#033[0m Warming up model
| 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:11.578534Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m328:#033[0m Setting max batch total tokens to 8192Copy | #033[2m2024-03-18T14:22:11.578534Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m328:#033[0m Setting max batch total tokens to 8192
| 2024-03-18T14:22:11.821Z#033[2m2024-03-18T14:22:11.578549Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m329:#033[0m ConnectedCopy | #033[2m2024-03-18T14:22:11.578549Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m329:#033[0m Connected
| 2024-03-18T14:22:12.824Z#033[2m2024-03-18T14:22:11.578553Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m334:#033[0m Invalid hostname, defaulting to 0.0.0.0Copy | #033[2m2024-03-18T14:22:11.578553Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m334:#033[0m Invalid hostname, defaulting to 0.0.0.0
| 2024-03-18T14:22:12.824Z#033[2m2024-03-18T14:22:12.646846Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.Copy | #033[2m2024-03-18T14:22:12.646846Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.
| 2024-03-18T14:22:12.824ZTraceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(Copy | Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 62, in serve serve(model_path, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in serve asyncio.run(serve_inner(model_path)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept(
| 2024-03-18T14:22:12.824Z> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(Copy | > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 43, in Prefill generations, batch = self.generator.prefill(request.batch) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 348, in prefill raise ValueError(
| 2024-03-18T14:22:12.824ZValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.Copy | ValueError: Cannot prefill 1 new request(s) with only 0 empty slots.Please align the number of concurrent requests with the static batch size: 4.
Here is the ENVs for my SageMaker deployment on inf2.xlarge
.
env={
"ENDPOINT_SERVER_TIMEOUT": "3600",
"HF_MODEL_ID": "/opt/ml/model",
"MODEL_CACHE_ROOT": "/opt/ml/model",
"SAGEMAKER_ENV": "1",
"HF_NUM_CORES": "2",
"HF_BATCH_SIZE": "4",
"MAX_BATCH_SIZE": "4",
"HF_SEQUENCE_LENGTH": "2048",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_TOTAL_TOKENS": "2048",
"MAX_INPUT_LENGTH": "512",
},
I am reproducing the error during warmup when there is a big prefill request: try setting also "MAX_BATCH_PREFILL_TOKENS" to something below batch_size sequence_length (like batch_size sequence_length // 2).
Thanks David. The endpoint is deployed successfully with MAX_BATCH_PREFILL_TOKENS
to be 512. However, it failed with following error when query an short payload shown as below. Please let me know if you need any other logging info from my end.
{'inputs': 'I believe the meaning of life is',
'parameters': {'max_new_tokens': 64,
'top_p': 0.9,
'temperature': 0.6,
'decoder_input_details': True,
'details': True}}
thread 'tokio-runtime-worker' panicked at router/src/infer.rs:598:14:
--
ID not found in entries. This is a bug.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
#033[2m2024-03-18T18:17:46.703406Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Webserver Crashed
#033[2m2024-03-18T18:17:46.703430Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
#033[2m2024-03-18T18:17:47.020107Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard terminated #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
Error: WebserverFailed
#033[2m2024-03-18T18:17:47.192339Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 512, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 512, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(4), enable_cuda_graphs: false, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/tmp"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
#033[2m2024-03-18T18:17:47.192435Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting download process.
#033[2m2024-03-18T18:17:47.301117Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m 'extension' argument is not supported and will be ignored.
#033[2m2024-03-18T18:17:49.495696Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights.
#033[2m2024-03-18T18:17:49.495937Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-03-18T18:17:51.044167Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Loading model on neuron devices (this can take a few minutes).
The issue arises during the server warmup, attempting to reach the maximum capacity. For some reason, it looks like the server ignores the MAX_BATCH_SIZE variable and evaluates the number of requests to reach maximum capacity by dividing an inferred quantity by MAX_INPUT_TOKENS. It seems that without anything else but MAX_INPUT_TOKENS, MAX_TOTAl_TOKENS specified, that quantity is the maximum number of tokens for the model (4096). I will need to sort that out, but in the meantime I suggest you always set MAX_TOTAL_TOKENS = 4096.
The deployment is actually successful, but the warmup phase doesn't end cleanly, leaving pending requests in the decoding queue that the router is not aware of. These requests will be purged on the very first failing incoming request (only the first client will get the error you mentioned). This is fixed by #522
System Info
I precompiled my artifacts to have following format under sequence length of 4K, batch size of 4, and tensor parallel degree of 2. The same set of code works for previous version of DLC 0.0.17
During deployment in SageMaker endpoint, we have following error.
Here is the ENV variable I specified during SageMaker endpoint deployment.