microsoft / vidur

A large-scale simulation framework for LLM inference
MIT License
152 stars 17 forks source link

Kaleido subprocess Segmentation fault #20

Closed rajeshitshoulders closed 5 days ago

rajeshitshoulders commented 2 weeks ago

Hi, I'm getting below error when trying to start vidur simulator in ubuntu 20.04 on python 3.10 venv, also i tested with mambo

INFO 07-09 16:17:21 config.py:21] trace_request_length_generator_decode_scale_factor: 1 INFO 07-09 16:17:21 config.py:21] trace_request_length_generator_prefill_scale_factor: 1 INFO 07-09 16:17:21 config.py:21] trace_request_length_generator_trace_file: ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv INFO 07-09 16:17:21 config.py:21] vllm_scheduler_max_tokens_in_batch: 4096 INFO 07-09 16:17:21 config.py:21] vllm_scheduler_watermark_blocks_fraction: 0.01 INFO 07-09 16:17:21 config.py:21] write_chrome_trace: true INFO 07-09 16:17:21 config.py:21] write_json_trace: false INFO 07-09 16:17:21 config.py:21] write_metrics: true INFO 07-09 16:17:21 config.py:21] zipf_request_length_generator_scramble: false INFO 07-09 16:17:21 config.py:21] zipf_request_length_generator_theta: 0.4 INFO 07-09 16:17:21 config.py:21] INFO 07-09 16:17:21 trace_request_length_generator.py:81] Loaded request length trace file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv with 28257 requests INFO 07-09 16:17:22 simulator.py:56] Starting simulation with cluster: Cluster({'id': 0, 'num_replicas': 1}) and 127 requests INFO 07-09 16:17:24 simulator.py:76] Simulation ended at: 51.67980373407166s INFO 07-09 16:17:24 simulator.py:79] Writing output Exception ignored in atexit callback: <bound method Simulator._write_output of <vidur.simulator.Simulator object at 0xfffd0f347ac0>> Traceback (most recent call last): File "/home/nvidia/vidur/vidur/simulator.py", line 81, in _write_output self._metric_store.plot() File "/home/nvidia/vidur/vidur/metrics/metrics_store.py", line 34, in wrapper return func(self, *args, *kwargs) File "/home/nvidia/vidur/vidur/metrics/metrics_store.py", line 499, in plot self._store_request_metrics(dir_plot_path) File "/home/nvidia/vidur/vidur/metrics/metrics_store.py", line 403, in _store_request_metrics dataseries.plot_histogram(base_plot_path, dataseries._y_name) File "/home/nvidia/vidur/vidur/metrics/data_series.py", line 295, in plot_histogram fig.write_image(f"{path}/{plot_name}.png") File "/home/nvidia/.vidru/lib/python3.10/site-packages/plotly/basedatatypes.py", line 3841, in write_image return pio.write_image(self, args, **kwargs) File "/home/nvidia/.vidru/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 266, in write_image img_data = to_image( File "/home/nvidia/.vidru/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 143, in to_image img_bytes = scope.transform( File "/home/nvidia/.vidru/lib/python3.10/site-packages/kaleido/scopes/plotly.py", line 153, in transform response = self._perform_transform( File "/home/nvidia/.vidru/lib/python3.10/site-packages/kaleido/scopes/base.py", line 293, in _perform_transform self._ensure_kaleido() File "/home/nvidia/.vidru/lib/python3.10/site-packages/kaleido/scopes/base.py", line 198, in _ensure_kaleido raise ValueError(message) ValueError: Failed to start Kaleido subprocess. Error stream:

/home/nvidia/.vidru/lib/python3.10/site-packages/kaleido/executable/kaleido: line 11: 257246 Segmentation fault /home/nvidia/.vidru/lib/python3.10/site-packages/kaleido/executable/bin/kaleido $@

I tried with different version of Kaliedo and ploty, still no luck.

any help would be greatly appreciated

AgrawalAmey commented 2 weeks ago

Hi,

Can you share more details about your setup? OS and kaleido versions? For now as a workaround, you can disable logging of plot by setting metric_store_store_plots to false.

rajeshitshoulders commented 2 weeks ago

Hi Agarwal -

OS: Ubunt 22.04 and also in Ubuntu 20.04 kaleido = 0.2.1 (also tried with 0.1.0, 0.2.0) same error.

I will try the suggested work around