Fails hard on large requests

System Info

We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.

infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered

we are running it in docker with 1 GPU on A100 PCIe runpod.io:

lorax-launcher --model-id microsoft/phi-2 --adapter-source s3  --compile --dtype bfloat16  --port 3000 --revision ef382358ec9e382308935a992d908de099b64c23 --max-input-length 2000 --max-total-tokens 2048 --env
2024-06-22T01:38:49.630259Z  INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.74.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Jun 22 01:38:49 2024
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A100 80GB PCIe          On  | 00000000:E1:00.0 Off |                    0 |
   | N/A   34C    P0              61W / 300W |  71234MiB / 81920MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+
2024-06-22T01:38:49.630346Z  INFO lorax_launcher: Args { model_id: "microsoft/phi-2", adapter_id: None, source: "hub", adapter_source: "s3", revision: Some("ef382358ec9e382308935a992d908de099b64c23"), validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: true, dtype: Some(BFloat16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "960a5e26c0d7", port: 3000, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false }

full request log:

2024-06-22T01:12:03.526879786Z 2024-06-22T01:12:03.526727Z ERROR HTTP request{
otel.name=POST
/v1/chat/completions
http.flavor=1.1
http.method=POST
http.route=/v1/chat/completions
http.scheme=HTTP
http.target=/v1/chat/completions
http.user_agent=Ktor
client
otel.kind=server
trace_id=e68f52322fc88977fb39f91db1970199
http.status_code=200 otel.status_code="OK"
}:chat_completions_v1{default_return_full_text=Extension(false) info=Extension(Info {

model_id: "microsoft/phi-2",
model_sha: Some("ef382358ec9e382308935a992d908de099b64c23"),
model_dtype: "torch.bfloat16",
model_device_type: "cuda",
model_pipeline_tag: Some("text-generation"),
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 2000,
max_total_tokens: 2048,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 188144,
max_waiting_tokens: 20,
validation_workers: 2,
version: "0.1.0",
sha: None,
docker_label: None,
request_logger_url: None }

) request_logger_sender=Extension(Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x55bd96f750e0,
tail_position: 0 },
semaphore: Semaphore { semaphore: Semaphore { permits: 32 },
bound: 32 }, rx_waker: AtomicWaker, tx_count: 1, rx_fields: "..." } } })

req_headers={
"user-agent": "Ktor client",
"content-length": "5341",
"accept": "text/event-stream,application/json",
"accept-charset": "UTF-8",
"accept-encoding": "gzip,
br",
"authorization": "Bearer",
"cache-control": "no-cache",
"cdn-loop": "cloudflare",
"cf-ipcountry": "US",
"cf-ray": "89785e2c4ce3ce40-SJC",
"cf-visitor": "{\"scheme\":\"https\"}",
"content-type": "application/json", "x-forwarded-for": "", "x-forwarded-host": "some.proxy.runpod.net",
"x-forwarded-proto": "https"}}:async_stream:generate_stream{

request=GenerateRequest { inputs: "[{\"content\":\"You are a helpful assistant. Write your answers using markdown markup.\",\"role\":\"system\"},{\"content\":\"write a quick sort in kotlin\",\"role\":\"user\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a quick sort implementation in Kotlin:\\n```kotlin\\nfun quickSort(array: Array<Int>): Array<Int> {\\n    if (array.size <= 1) {\\n        return array\\n    }\\n    var pivot = array[array.size / 2]\\n    var left = Array<Int>()\\n    var right = Array<Int>()\\n    for (i in 0 until array.size) {\\n        if (array[i] < pivot) {\\n            left += array[i]\\n        } else {\\n            right += array[i]\\n        }\\n    }\\n    return quickSort(left) + Array(pivot) + quickSort(right)\\n}\\n```\\nThis implementation uses the divide-and-conquer approach to sort an array. It selects a pivot element and partitions the array into two sub-arrays, one containing elements smaller than the pivot and the other containing elements greater than the pivot. It then recursively sorts the sub-arrays and combines them with the pivot element to obtain the sorted array.\\n\"},{\"content\":\"write an SQL query to select all users that were active this year\",\"role\":\"user\"},{\"content\":\"Sure, here's an SQL query that selects all users that were active this year:\\n```sql\\nSELECT * FROM users\\nWHERE active_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)\\nAND active_date <= CURDATE();\\n```\\nThis query selects all rows from the `users` table where the `active_date` column is greater than or equal to one year ago from the current date and less than or equal to the current date. This will return all users that were active this year.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a pocker simulator in kotlin\"},{\"content\":\"Sure, here's a simple pocker simulator in Kotlin:\\n```kotlin\\nfun main(args: Array<String>) {\\n    val pocker = Pocker()\\n    val pocker.start()\\n    val pocker.stop()\\n    println(\\\"Pocker stopped.\\\")\\n}\\n```\\nThis implementation creates a new instance of the `Pocker` class and starts the pocker process. It then stops the pocker process and prints a message indicating that the pocker has stopped.\\n\",\"role\":\"assistant\"},{\"role\":\"user\",\"content\":\"write a red black tree algorithm\"},{\"role\":\"assistant\",\"content\":\"Sure, here's a simple implementation of a red-black tree algorithm in Kotlin:\\n```kotlin\\nclass Node<T> {\\n    private var key: Int\\n    private var value: T\\n    private var left: Node<T>?\\n    private var right: Node<T>?\\n    private var color: String\\n\\n    override fun toString(): String {\\n        return \\\"$key: $value\\\"\\n    }\\n}\\n\\nclass RedBlackTree<T> {\\n    private var root: Node<T>?\\n\\n    fun insert(key: Int, value: T): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                return false\\n            }\\n        }\\n        val newNode = Node(key, value)\\n        if (root == null) {\\n            root = newNode\\n        } else {\\n            newNode.left = root\\n            newNode.right = root\\n            root = newNode\\n        }\\n        return true\\n    }\\n\\n    fun delete(key: Int): Boolean {\\n        val node = root\\n        while (node!= null) {\\n            if (key < node.key) {\\n                node = node.left\\n            } else if (key > node.key) {\\n                node = node.right\\n            } else {\\n                if (node.left == null && node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    root = null\\n                } else if (node.left == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.right.color = \\\"red\\\"\\n                    }\\n                    node = node.right\\n                } else if (node.right == null) {\\n                    if (node.color == \\\"red\\\") {\\n                        node.color = \\\"black\\\"\\n                        node.left.color = \\\"red\\\"\\n                    }\\n                    node = node.left\\n                } else {\\n                    val successor = findSuccessor(node.right)\\n                    val temp = successor.key\\n                    successor.key = node.key\\n                    node.key = temp\\n                    delete(temp)\\n                }\\n            }\\n        }\\n        return true\\n    }\\n\\n    private fun findSuccessor(node: Node<T>): Node<T> {\\n        val current = node\\n        while (current.left!= null) {\\n            current = current.left\\n        }\\n        return current\\n    }\\n}\\n```\\nThis implementation defines a `Node` class to represent each node in the red-black tree, and a `RedBlackTree` class to represent the tree itself. The `insert` method inserts a new node into the tree, while the `delete` method deletes a node from the tree. The `findSuccessor` method finds the successor of a given node in the tree.\\n\"},{\"content\":\"write self balancing tree algorithm\",\"role\":\"user\"}]",
parameters: GenerateParameters {
adapter_id: Some("s3://mybucket/model-1253878534445035520/"),
adapter_source: None,
adapter_parameters: None,
api_token: None,
best_of: None,
temperature: Some(1e-7),
repetition_penalty: None,
top_k: None,
top_p: None,
typical_p: None,
do_sample: false,
max_new_tokens: None,
ignore_eos_token: false,
return_full_text: Some(false),
stop: ["<|im_end|>","<|im_end|>"],
truncate: None,
watermark: false,
details: true,
decoder_input_details: false,
return_k_alternatives: None,
apply_chat_template: true,
seed: None,
response_format: None } }}:infer:send_error: lorax_router::infer: router/src/infer.rs:665: Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: CUDA error: device-side assert triggered
}

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Run lorax
Send chat completion requests with long context
at some point response streaming hangs
all next requests fail

Expected behavior

if one request fails consequent request should not be failing.

Stacktrace:

2024-06-22T01:35:24.940390508Z 2024-06-22T01:35:24.940129Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
2024-06-22T01:35:24.940438547Z Traceback (most recent call last):
2024-06-22T01:35:24.940442357Z   File "/opt/conda/bin/lorax-server", line 8, in <module>
2024-06-22T01:35:24.940444837Z     sys.exit(app())
2024-06-22T01:35:24.940447727Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2024-06-22T01:35:24.940450087Z     return get_command(self)(*args, **kwargs)
2024-06-22T01:35:24.940452947Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-06-22T01:35:24.940455237Z     return self.main(*args, **kwargs)
2024-06-22T01:35:24.940457377Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2024-06-22T01:35:24.940459427Z     return _main(
2024-06-22T01:35:24.940461527Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2024-06-22T01:35:24.940463547Z     rv = self.invoke(ctx)
2024-06-22T01:35:24.940465637Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-06-22T01:35:24.940467637Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-06-22T01:35:24.940469767Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-06-22T01:35:24.940471797Z     return ctx.invoke(self.callback, **ctx.params)
2024-06-22T01:35:24.940473867Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-06-22T01:35:24.940475907Z     return __callback(*args, **kwargs)
2024-06-22T01:35:24.940477937Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2024-06-22T01:35:24.940479937Z     return callback(**use_params)  # type: ignore
2024-06-22T01:35:24.940481977Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
2024-06-22T01:35:24.940483977Z     server.serve(
2024-06-22T01:35:24.940486097Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
2024-06-22T01:35:24.940488187Z     asyncio.run(
2024-06-22T01:35:24.940490297Z   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2024-06-22T01:35:24.940492517Z     return loop.run_until_complete(main)
2024-06-22T01:35:24.940494587Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-06-22T01:35:24.940496737Z     self.run_forever()
2024-06-22T01:35:24.940498877Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-06-22T01:35:24.940500997Z     self._run_once()
2024-06-22T01:35:24.940503147Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-06-22T01:35:24.940505417Z     handle._run()
2024-06-22T01:35:24.940507627Z   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2024-06-22T01:35:24.940509857Z     self._context.run(self._callback, *self._args)
2024-06-22T01:35:24.940518256Z   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2024-06-22T01:35:24.940521216Z     return await self.intercept(
2024-06-22T01:35:24.940523476Z > File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
2024-06-22T01:35:24.940525606Z     return await response
2024-06-22T01:35:24.940527986Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-06-22T01:35:24.940530426Z     raise error
2024-06-22T01:35:24.940532486Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-06-22T01:35:24.940534576Z     return await behavior(request_or_iterator, context)
2024-06-22T01:35:24.940538416Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 88, in Prefill
2024-06-22T01:35:24.940540536Z     batch = self.model.batch_type.from_pb(
2024-06-22T01:35:24.940542666Z   File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 272, in from_pb
2024-06-22T01:35:24.940544706Z     adapter_indices = torch.cat(adapter_indices_list).to(dtype=torch.int64, device=device)
2024-06-22T01:35:24.940550316Z RuntimeError: CUDA error: device-side assert triggered
2024-06-22T01:35:24.940552636Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-06-22T01:35:24.940554636Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-06-22T01:35:24.940556746Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

predibase / lorax

Fails hard on large requests #523

System Info

Information

Tasks

Reproduction

Expected behavior