openshift-psap / topsail

Test Orchestrator for Performance and Scalability of AI pLatforms
Apache License 2.0
11 stars 16 forks source link

[kserve] Update to RHOAI 2.15 #576

Open mcharanrm opened 3 weeks ago

mcharanrm commented 3 weeks ago

Updated tag and version fields in kserve/config.yaml file for OpenShift AI 2.15.0 RC1 model serving performance validation.

I will use the exiting CI presents "cpt_single_model_gating" and "vllm_cpt_single_model_gating" to deploy LLMs using "TGIS standalone servingruntime" & "vLLM ServingRuntime" when launching e2e CPT tests through topsail from middleware CI.

openshift-ci[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign ccamacho for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift-psap/topsail/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
topsail-bot[bot] commented 3 weeks ago

Jenkins Job #1585

:red_circle: Test of 'rhoai test test_ci' failed after 06 hours 58 minutes 49 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run kserve test test_ci
PR_POSITIONAL_ARGS: cpt_single_model_gating
PR_POSITIONAL_ARG_0: kserve-perf-ci
PR_POSITIONAL_ARG_1: cpt_single_model_gating

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1585/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//002_test_ci/FAILURES/view/):

/logs/artifacts/002_test_ci/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected)
RuntimeError: An error happened during the visualization post-processing ... (regression detected)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 464, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

[Test ran on the internal Perflab CI]

topsail-bot[bot] commented 3 weeks ago

Jenkins Job #1586

:red_circle: Test of 'rhoai test test_ci' failed after 07 hours 34 minutes 32 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run kserve test test_ci
PR_POSITIONAL_ARGS: vllm_cpt_single_model_gating
PR_POSITIONAL_ARG_0: kserve-perf-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1586/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//002_test_ci/FAILURES/view/):

/logs/artifacts/002_test_ci/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected)
RuntimeError: An error happened during the visualization post-processing ... (regression detected)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 464, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

[Test ran on the internal Perflab CI]

mcharanrm commented 3 weeks ago

Both tests have been completed successfully but it's status is marked as failed because few KPIs didn't pass in the regression analysis.

Topsail performs regression analysis for llm-load-test KPIs as-well-as for resource utilization KPIs. Looking at the regression-analysis report, the number of KPIs didn't pass are very less, (3/507), (1/572) and (5/60), which is insignificant. The KPIs didn't pass are neither appearing from same LLM model nor from the same KPI across different models.

The values recorded in the KPIs that didn't pass the regression analysis seem like potential outliers to me. We should acknowledge it safely and consider that this is not a blocker for 2.15.0 RC1 model-serving performance since there is no trace of potential regression identified.

topsail-bot[bot] commented 2 weeks ago

Jenkins Job #1589

:red_circle: Test of 'rhoai test test_ci' failed after 08 hours 04 minutes 25 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run kserve test test_ci
PR_POSITIONAL_ARGS: vllm_cpt_single_model_gating
PR_POSITIONAL_ARG_0: kserve-perf-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1589/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//002_test_ci/FAILURES/view/):

/logs/artifacts/002_test_ci/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected)
RuntimeError: An error happened during the visualization post-processing ... (regression detected)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 464, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

[Test ran on the internal Perflab CI]

topsail-bot[bot] commented 2 weeks ago

Jenkins Job #1591

:red_circle: Test of 'rhoai test test_ci' failed after 06 hours 55 minutes 52 seconds. :red_circle:

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run kserve test test_ci
PR_POSITIONAL_ARGS: cpt_single_model_gating
PR_POSITIONAL_ARG_0: kserve-perf-ci
PR_POSITIONAL_ARG_1: cpt_single_model_gating

• Link to the Rebuild page.

[Failure indicator](https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/RHODS/job/topsail/1591/artifact/run/f23-h33-000-6018r.rdu2.scalelab.redhat.com//002_test_ci/FAILURES/view/):

/logs/artifacts/002_test_ci/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected)
RuntimeError: An error happened during the visualization post-processing ... (regression detected)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 464, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

[Test ran on the internal Perflab CI]