Crashing with large number of concurrent users #17

Open francescov1 opened 7 months ago

francescov1 commented 7 months ago

I'm finding that this tool often crashes when running tests at concurrency levels >500.

Here's the command I'm running: python llmperf.py -f openai -r 2000 -c 2000 -m "mistralai/Mistral-7B-v0.1"

Here's the error log I'm seeing:

(pid=91343) [2023-11-30 19:53:26,142 E 91343 91820] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(pid=98374) E1130 19:53:52.486123357   99023 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1)
2023-11-30 19:53:53,141 WARNING worker.py:2074 -- The node with node id: 19a548a9a5a2ece8fe9589ae0846a32901d76bf544f5e286201bb770 and address: and node name: has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
  File "/home/francescovirga/llmperf/llmperf.py", line 480, in <module>
    query_results = endpoint_evaluation(endpoint_config, sample_lines)
  File "/home/francescovirga/llmperf/llmperf.py", line 270, in endpoint_evaluation
    results = ray.get(futures)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(pid=98046) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. [repeated 74x across cluster]
(validate pid=68909) [2023-11-30 19:53:53,594 E 68909 69441] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 205x across cluster]
(pid=85196) E1130 19:53:51.788929082   85763 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1) [repeated 666x across cluster]