knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.56k stars 1.16k forks source link

During create new revision of model Return http code 0 #13230

Closed gawsoftpl closed 2 years ago

gawsoftpl commented 2 years ago

/kind bug

When I create new revision for model I received a lots http code 0 for high load (200 requests per second)

vegeta attack -duration=60s -timeout=5s -rate=200 --targets=targets.txt | vegeta report --type=text
^CRequests      [total, rate, throughput]         10619, 200.02, 28.99
Duration      [total, attack, wait]             58.09s, 53.09s, 5s
Latencies     [min, mean, 50, 90, 95, 99, max]  434.187ms, 4.617s, 5s, 5s, 5s, 5.001s, 5.009s
Bytes In      [total, mean]                     5427245, 511.09
Bytes Out     [total, mean]                     185349460, 17454.51
Success       [ratio]                           15.86%
Status Codes  [code:count]                      0:8935  200:1684  

After install new revisions only 50% of pods received traffic for inference. Rest 50% has 0 traffic.

I dont know that this is a error for knative or istio?

When I delete all models, wait for delete all pods, create models from beginning. Everything works great.

Model architecture

Error is When I create new revision of model 1 or model 2. Transformer split traffic of grpc request only for part of new models version.

Error in istio:

2022-08-18T18:28:50.243968Z    info    ads    Push Status: {
    "pilot_vservice_dup_domain": {
        "ml-cookies-ensemble-predictor-default.default.svc.cluster.local:80": {
            "proxy": "ml-cookies-tabular-features-predictor-default-00003-deployg2h8j.default",
            "message": "duplicate domain from service: ml-cookies-ensemble-predictor-default.default.svc.cluster.local:80"
        },
        "ml-cookies-nn-predictor-default.default.svc.cluster.local:80": {
            "proxy": "ml-cookies-tabular-features-predictor-default-00003-deployg2h8j.default",
            "message": "duplicate domain from service: ml-cookies-nn-predictor-default.default.svc.cluster.local:80"
        },
        "ml-cookies-tabular-features-predictor-default.default.svc.cluster.local:80": {
            "proxy": "ml-cookies-tabular-features-predictor-default-00003-deployg2h8j.default",
            "message": "duplicate domain from service: ml-cookies-tabular-features-predictor-default.default.svc.cluster.local:80"
        }
    }
}
                                                                                       ││   }     

Environment:

nader-ziada commented 2 years ago

do you see any error in the kserve logs? it hard to tell which layer could have an issue

gawsoftpl commented 2 years ago

Error resolved. Error was because during high load I run new revision and change 100% traffic to new revision immediately. During this process server reach bottleneck. When I change deployment to Canary with 10% step everything works fine.

nader-ziada commented 2 years ago

thanks for the update, will close the issue for now and please feel free to reopen if you see the issue again

/close

knative-prow[bot] commented 2 years ago

@nader-ziada: Closing this issue.

In response to [this](https://github.com/knative/serving/issues/13230#issuecomment-1222373253): >thanks for the update, will close the issue for now and please feel free to reopen if you see the issue again > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.