iter8-tools / iter8

Kubernetes release optimizer
https://iter8.tools
Apache License 2.0
254 stars 34 forks source link

KServe performance tutorial failed #1410

Open kalantar opened 1 year ago

kalantar commented 1 year ago

Describe the bug Running tutorial failed. Report had no output. Logs have

% kubectl logs job.batch/default-1-job 
time=2023-02-27 18:25:23 level=info msg=task 1: run: started
time=2023-02-27 18:25:23 level=info msg=task 1: run: completed
time=2023-02-27 18:25:23 level=info msg=task 2: http: started
time=2023-02-27 18:25:23 level=error msg=fortio failed stack-trace=below ... 
::Trace:: lookup sklearn-irisv2.default.svc.cluster.local on 10.96.0.10:53: no such host
time=2023-02-27 18:25:23 level=error msg=failed to get results since fortio run was aborted
time=2023-02-27 18:25:23 level=error msg=task 2: http: failure
Error: lookup sklearn-irisv2.default.svc.cluster.local on 10.96.0.10:53: no such host

Referenced service seems to exist.

To Reproduce Run tutorial.

Additional context On retry it was observed that the inference service took > 5 minutes to become ready. It may be that the readiness check failed to cause experiment to fail.

kalantar commented 1 year ago

It appears that that the basic readiness check works. I set the timeout to 10s and increased the logging. I repeatedly see:

time=2023-02-27 21:35:44 level=trace msg=looking for resource (serving.kserve.io/v1beta1) inferenceservices: sklearn-irisv2 in namespace default
time=2023-02-27 21:35:44 level=trace msg=looking for condition: Ready
time=2023-02-27 21:35:44 level=error msg=condition status not True
followed by
time=2023-02-27 21:35:44 level=error msg=task 1: ready: failure

iter8 k report correctly identifies a failed task/experiment:

Experiment summary:
*******************

  Experiment completed: false
  No task failures: false
  Total number of tasks: 4
  Number of completed tasks: 0
  Number of completed loops: 1
kalantar commented 1 year ago

Copy of slack comment:

I wonder if Fortio's

  -allow-initial-errors
        Allow and don't abort on initial warmup errors

should be exposed as a parameter in the http task. This might be a simple "fix" worth trying ... of course, more "warmup" behavior can also be defined in the task if this is insufficient.