opea-project / GenAIInfra

Containerization and cloud native suite for OPEA
Apache License 2.0
16 stars 22 forks source link

GMC: apply deployment failure sometimes #108

Closed KfreeZ closed 1 week ago

KfreeZ commented 1 week ago

below logs shows the GMC controller has retried 3 times to provision a deployment of tgi-service-deployment the error might links to the deployment.ManagedFields

reconcile resource for node: Tgi
trying to reconcile internal service [ tgi-service ] in namespace  chatqna-20240619083553-codegen
get step Tgi config for tgi-service@chatqna-20240619083553-codegen: &map[LLM_MODEL_ID:ise-uiuc/Magicoder-S-DS-6.7B endpoint:/generate]
The raw yaml file has been split into 3 yaml files
Success to reconcile Deployment: tgi-service-deployment
2024-06-19T08:39:04Z    INFO    Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler  {"controller": "gmconnector", "controllerGroup": "gmc.opea.io", "controllerKind": "GMConnector", "GMConnector": {"name":"codegen","namespace":"chatqna-20240619083553-codegen"}, "namespace": "chatqna-20240619083553-codegen", "name": "codegen", "reconcileID": "cec955c5-baa7-4cb3-b917-dd31c84a763b"}
2024-06-19T08:39:04Z    ERROR   Reconciler error    {"controller": "gmconnector", "controllerGroup": "gmc.opea.io", "controllerKind": "GMConnector", "GMConnector": {"name":"codegen","namespace":"chatqna-20240619083553-codegen"}, "namespace": "chatqna-20240619083553-codegen", "name": "codegen", "reconcileID": "cec955c5-baa7-4cb3-b917-dd31c84a763b", "error": "Failed to reconcile service for tgi-service: Failed to update deployment: Operation cannot be fulfilled on deployments.apps \"tgi-service-deployment\": the object has been modified; please apply your changes to the latest version and try again\n", "errorVerbose": "Failed to update deployment: Operation cannot be fulfilled on deployments.apps \"tgi-service-deployment\": the object has been modified; please apply your changes to the latest version and try again\n\nFailed to reconcile service for tgi-service\ngithub.com/opea-project/GenAIInfra/microservices-connector/internal/controller.(*GMConnectorReconciler).Reconcile\n\t/workspace/internal/controller/gmconnector_
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
Reconciling connector graph apiVersion gmc.opea.io/v1alpha3 graph codegen
adjust config: /tmp/microservices/yamls/qna_configmap_xeon.yaml
Success to apply the adjusted configmap
reconcile resource for node: Llm
trying to reconcile internal service [ llm-service ] in namespace  chatqna-20240619083553-codegen
get step Llm config for llm-service@chatqna-20240619083553-codegen: &map[endpoint:/v1/chat/completions]
The raw yaml file has been split into 3 yaml files
Success to reconcile Deployment: llm-service-deployment
Success to reconcile Service: llm-service
the service URL is: http://llm-service.chatqna-20240619083553-codegen.svc.cluster.local:9000/v1/chat/completions
reconcile resource for node: Tgi
trying to reconcile internal service [ tgi-service ] in namespace  chatqna-20240619083553-codegen
get step Tgi config for tgi-service@chatqna-20240619083553-codegen: &map[LLM_MODEL_ID:ise-uiuc/Magicoder-S-DS-6.7B endpoint:/generate]
The raw yaml file has been split into 3 yaml files
Success to reconcile Deployment: tgi-service-deployment
2024-06-19T08:39:08Z    INFO    Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler  {"controller": "gmconnector", "controllerGroup": "gmc.opea.io", "controllerKind": "GMConnector", "GMConnector": {"name":"codegen","namespace":"chatqna-20240619083553-codegen"}, "namespace": "chatqna-20240619083553-codegen", "name": "codegen", "reconcileID": "34a6aebd-b88d-4d5d-a8e2-509f95e5fd97"}
2024-06-19T08:39:08Z    ERROR   Reconciler error    {"controller": "gmconnector", "controllerGroup": "gmc.opea.io", "controllerKind": "GMConnector", "GMConnector": {"name":"codegen","namespace":"chatqna-20240619083553-codegen"}, "namespace": "chatqna-20240619083553-codegen", "name": "codegen", "reconcileID": "34a6aebd-b88d-4d5d-a8e2-509f95e5fd97", "error": "Failed to reconcile service for tgi-service: Failed to update deployment: Operation cannot be fulfilled on deployments.apps \"tgi-service-deployment\": the object has been modified; please apply your changes to the latest version and try again\n", "errorVerbose": "Failed to update deployment: Operation cannot be fulfilled on deployments.apps \"tgi-service-deployment\": the object has been modified; please apply your changes to the latest version and try again\n\nFailed to reconcile service for tgi-service\ngithub.com/opea-project/GenAIInfra/microservices-connector/internal/controller.(*GMConnectorReconciler).Reconcile\n\t/workspace/internal/controller/gmconnector_
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
Reconciling connector graph apiVersion gmc.opea.io/v1alpha3 graph codegen
adjust config: /tmp/microservices/yamls/qna_configmap_xeon.yaml
Success to apply the adjusted configmap
reconcile resource for node: Llm
trying to reconcile internal service [ llm-service ] in namespace  chatqna-20240619083553-codegen
get step Llm config for llm-service@chatqna-20240619083553-codegen: &map[endpoint:/v1/chat/completions]
The raw yaml file has been split into 3 yaml files
Success to reconcile Deployment: llm-service-deployment
Success to reconcile Service: llm-service
the service URL is: http://llm-service.chatqna-20240619083553-codegen.svc.cluster.local:9000/v1/chat/completions
reconcile resource for node: Tgi
trying to reconcile internal service [ tgi-service ] in namespace  chatqna-20240619083553-codegen
get step Tgi config for tgi-service@chatqna-20240619083553-codegen: &map[LLM_MODEL_ID:ise-uiuc/Magicoder-S-DS-6.7B endpoint:/generate]
The raw yaml file has been split into 3 yaml files
Success to reconcile Deployment: tgi-service-deployment
Success to reconcile Service: tgi-service
KfreeZ commented 1 week ago

refer to #107 , add update the latest version when apply k8s resources, to solve this problem