kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.52k stars 405 forks source link

skr-aws-svcat-migration-dev job fails all the time #12843

Closed pbochynski closed 2 years ago

pbochynski commented 2 years ago

Description

The tests fails on smctl login:

  1) SKR SVCAT migration test
       Should provision new ServiceManager platform:
     Error: failed "smctl login -a https://service-manager.cfapps.sap.hana.ondemand.com --param subdomain=e2etestingscmigration --auth-flow client-credentials": Error: could not login
Reason: auth error: Client Authentication failed.
      at Object.provisionPlatform (skr-svcat-migration-test/test-helpers.js:202:15)
      at processTicksAndRejections (internal/process/task_queues.js:95:5)
      at async Context.<anonymous> (skr-svcat-migration-test/skr-svcat-migration-test.js:44:21)

Job history: https://status.build.kyma-project.io/job-history/gs/kyma-prow-logs/logs/skr-aws-svcat-migration-dev Sample failed test log: https://storage.googleapis.com/kyma-prow-logs/logs/skr-aws-svcat-migration-dev/1470302568000262144/build-log.txt

wozniakjan commented 2 years ago

@PK85 asked me to try locally and indeed, smctl fails to login with the credentials

$ smctl login -a "$(jq --raw-output '.sm_url' creds.json)" --param subdomain=e2etestingscmigration --auth-flow client-credentials --client-id "$(jq --raw-output '.clientid' creds.json)" --client-secret "$(jq --raw-output '.clientsecret' creds.json)" -v
DEBU[0000] Sending request GET https://service-manager.cfapps.sap.hana.ondemand.com/v1/info?subdomain=e2etestingscmigration  component="smclient/client.go:509" correlation_id=-
DEBU[0000] authenticator: oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"unauthorized","error_description":"Client Authentication failed."}  component="oidc/oidc.go:74" correlation_id=-
DEBU[0000] oidc error: oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"unauthorized","error_description":"Client Authentication failed."}  component="oidc/oidc.go:108" correlation_id=-
Error: could not login
Reason: auth error: Client Authentication failed.

maybe something in SM has changed and they stopped supporting this auth workflow?

wozniakjan commented 2 years ago

hmm, looks like bindings for the SM instance got removed, I guess we will need new binding then.. creds

wozniakjan commented 2 years ago

getting 401 even with new creds, contacted @michal-keidar for ideas to address this

wozniakjan commented 2 years ago

I recommend that you open an NGPBUG ticket to the dev team (component “Service Manager”). They have made some changes recently due to Crown Jewles so they will be able to investigate and assist.

here is the ticket https://jtrack.wdf.sap.corp/browse/NGPBUG-172313

PK85 commented 2 years ago

Status: ticket created for SM team, still noone assigned on their side. Bumped NGPBUG ticket priority to critical.

szwedm commented 2 years ago

Status: we resolved the issue with SM Instance binding. Now the test pipeline is failing because of random behavior of sm-proxy. In the test we are trying to provision 3 Service Instances from SM, but not all Service Brokers are present in the cluster. Depending on the test run, we see 1 or 2 random Service Brokers from SM, but not all required 3 and that's what we are trying to resolve now.

pbochynski commented 2 years ago

Right now this is a blocker for that pipeline: https://github.com/kyma-project/kyma/issues/12945

ksputo commented 2 years ago

One of the root issue was that service-manager-proxy with concurrent installation by reconciler was trying to create ClusterServiceBrokers when service-catalog components were not ready yet:

time="2021-12-30T09:45:54Z" level=error msg="Internal error occurred: failed calling webhook \"mutating.clusterservicebrokers.servicecatalog.k8s.io\": Post \"https://service-catalog-catalog-webhook.kyma-system.svc:443/mutating-clusterservicebrokers?timeout=30s\": no endpoints available for service \"service-catalog-catalog-webhook\"" component="reconcile/task_scheduler.go:43" correlation_id=0293c363-71ef-4222-a4ef-5faf97ad4bd3

This one was resolved in: https://github.com/kyma-project/kyma/pull/12959

ksputo commented 2 years ago

Last issue for now is sc-removal job being unable to remove finalizers from UsageKind. It was working fine with Kyma 1.24.8.The job fails with error:

panic: Operation cannot be fulfilled on usagekinds.servicecatalog.kyma-project.io "serverless-function": the object has been modified; please apply your changes to the latest version and try again

We think that might be caused by reconciler reverting the state of the CR because of annotation reconciler.kyma-project.io/managed-by-reconciler-disclaimer:

apiVersion: servicecatalog.kyma-project.io/v1alpha1
kind: UsageKind
metadata:
  annotations:
    reconciler.kyma-project.io/managed-by-reconciler-disclaimer: |-
      DO NOT EDIT - This resource is managed by Kyma.
      Any modifications are discarded and the resource is reverted to the original state.
  creationTimestamp: "2021-12-30T15:28:48Z"
  finalizers:
  - servicecatalog.kyma-project.io/usage-kind-protection
  generation: 1
  labels:
    reconciler.kyma-project.io/managed-by: reconciler
    reconciler.kyma-project.io/origin-version: PR-12959
  name: serverless-function
  resourceVersion: "10816"
  uid: 07516338-e30f-4c3f-b69c-c56ff5909db8
spec:
  displayName: Function
  labelsPath: spec.labels
  resource:
    group: serverless.kyma-project.io
    kind: function
    version: v1alpha1

This seems to be quite random, because manual intervention during the pipeline execution (deleting migration job Pod) sometimes helps and the job manages to remove finalizers from UsageKind.

Possible workaround to verify:

ksputo commented 2 years ago

Resolving sc-removal issues will require additional work with custom reconciler, but it does not block Kyma 2.0 release, only the service management migration (cc: @wozniakjan )

Problems with ClusterServiceBrokers being unavailable on the clusters in tests were mitigated by adding initContainer to service-manager-proxy and will be cherry-picked here:

Regarding those, I am removing release-blocker label from this one.

pbochynski commented 2 years ago

The pipeline is still red - waiting for update operation.

wozniakjan commented 2 years ago

currently, the failure is related to the btp-operator selfsigned cert. There was a request to move btp-operator to a different namespace which was done in https://github.tools.sap/kyma/management-plane-config/pull/1304, but the self-signed cert was not adjusted.

binding 'func-sb-svcat-html5-apps-repo-1' in namespace 'default' failed: 'Internal error occurred: failed calling webhook "mservicebinding.kb.io": Post "https://sap-btp-operator-webhook-service.kyma-system.svc:443/mutate-services-cloud-sap-com-v1alpha1-servicebinding?timeout=10s": x509: certificate is valid for sap-btp-operator-webhook-service.sap-btp-operator.svc, sap-btp-operator-webhook-service.sap-btp-operator.svc.cluster.local, not sap-btp-operator-webhook-service.kyma-system.svc'

fwiw, the sc-removal was supposed to be deprecated by sc-migration reconciler but constant shift in priorities slowed down that development. I don't think it is worth the effort at the moment to work on addressing all other issues with sc-removal chart and rather we should finish the sc-migration reconciler task.

ksputo commented 2 years ago

The sc-migration reconciler can be tracked here: https://github.com/kyma-incubator/reconciler/pull/389

wozniakjan commented 2 years ago

moving forward with two more findings regarding the pipeline 1) one already reported by https://github.com/kyma-project/kyma/issues/12843#issuecomment-1003089390

panic: Operation cannot be fulfilled on usagekinds.servicecatalog.kyma-project.io "serverless-function": the object has been modified; please apply your changes to the latest version and try again

2) the namespace for btp-operator and migrator was moved from sap-btp-operator which is default in https://github.com/SAP/sap-btp-service-operator/, to kyma-system. There is one more spot that needs to be changed: https://github.com/kyma-project/kyma/blob/4c7fdfa6f5bce804e44af0be5ed5c4ecdc509c8f/tests/fast-integration/skr-svcat-migration-test/skr-svcat-migration-test.js#L115

wozniakjan commented 2 years ago

the above two are fixed, the next one is something is wrong with the kubeconfig inside of the test pipeline

  1) SKR SVCAT migration test
       Should check if pod presets injected secrets to functions containers:
     Error: failed to execute kubectl exec svcat-auditlog-api-1-t79n4-5bdc775c95-v48x5 -c function -n default -- sh -c for v in uaa url vendor; do x="$(eval echo \$$v)"; if [[ -z "$x" ]]; then echo missing $v env variable; exit 1; else echo found $v env variable; fi; done:
,
The connection to the server localhost:8080 was refused - did you specify the right host or port?

addressed in https://github.com/kyma-project/kyma/pull/13064

wozniakjan commented 2 years ago

light at the end of the tunnel

  1) SKR SVCAT migration test
       Should check if pod presets injected secrets to functions containers:
     Error: failed to execute kubectl exec svcat-auditlog-api-1-vj6tk-7c54fc57cc-g7f2c -c function -n default -- sh -c for v in uaa url vendor; do x="$(eval echo \$$v)"; if [[ -z "$x" ]]; then echo missing $v env variable; exit 1; else echo found $v env variable; fi; done:
missing uaa env variable,
command terminated with exit code 1
      at kubectlExecInPod (utils/index.js:693:11)
      at processTicksAndRejections (internal/process/task_queues.js:95:5)
      at async Object.checkPodPresetEnvInjected (skr-svcat-migration-test/test-helpers.js:78:9)
      at async Context.<anonymous> (skr-svcat-migration-test/skr-svcat-migration-test.js:110:5)

in kyma1.x, the pod preset containers for svcat-auditlog-api had these env variables set uaa, url, vendor by SBUs. But for kyma2.x after the migration and cleanup, none of the three env vars is injected. After discussing with @piotrmiskiewicz and @voigt, we decided that is expected due to the current state of the implementation but not desired. Right now the quick path forward is to put logic to https://github.com/kyma-incubator/sc-removal replacing deprecated SBUs by mounting the secrets directly in the pod preset containers, and later propagate that to sc-migration reconciler.

wozniakjan commented 2 years ago

Right now the quick path forward is to put logic to https://github.com/kyma-incubator/sc-removal replacing deprecated SBUs by mounting the secrets directly in the pod preset containers, and later propagate that to sc-migration reconciler.

~turns out this is not that quick of a path. Any change on the deployment is reverted by functions-controller instantly, so we can't easily add a mount to the deployment referencing the binding secret and inject env vars. Pod has those fields as immutable so we can't put it there either. The only thing we can do is a webhook on pods which is exactly how SBUs are implemented afaik at which point it might be easier just to keep the SBUs in place.~

never mind, looks like Functions have env vars as well, I will try to plug it there https://github.com/kyma-incubator/sc-removal/pull/13

wozniakjan commented 2 years ago

current failures are across multiple different tests very similar

  1) SKR SVCAT migration test
       Should deprovision SKR:
     Error: the string "Error: wait timeout ..."

Could be an outage on Service Manager side

 Should cleanup platform --cascade, operator instances and bindings:
 Error: failed "smctl deprovision btp-operator-xprc -f --mode=sync": Error: request DELETE https://service-manager.cfapps.sap.hana.ondemand.com/v1/service_instances/f3fff25c-1e23-4a61-9a30-94dc5add9b4b?async=false failed: StatusCode: 502 Body: {"error":"BrokerError","description":"Failed deprovisioning request instanceID: f3fff25c-1e23-4a61-9a30-94dc5add9b4b, planID: 136d6248-1bed-45e3-912a-f553406c3ab5, serviceID: 6e6cc910-c2f7-4b95-a725-c986bb51bad7, acceptsIncomplete: true: Status: 400; ErrorMessage: \u003cnil\u003e; Description: error occurred while executing deprovision operation. Please contact Service Manager broker administrator; ResponseError: \u003cnil\u003e"

but other parts of the test pipeline are passing.

wozniakjan commented 2 years ago

another fix related to this: https://github.com/kyma-project/kyma/pull/13174, we were leaking resources, btp-operator creds couldn't be deleted as a result

wozniakjan commented 2 years ago

the last outstanding failure cause in the tests is a timeout on deprovisioning. It came to my knowledge that is due to https://github.com/kyma-incubator/reconciler/issues/647 and we shouldn't conceal the problem by increasing deprovisioning timeout.

I will keep this issue open for now and passively monitor the resolution of https://github.com/kyma-incubator/reconciler/issues/647

wozniakjan commented 2 years ago

a brand new error started appearing today morning

  1) SKR SVCAT migration test
       Should get Runtime Status after provisioning:
     Error: kcp command failed: Error: while listing runtimes: calling https://kyma-env-broker.cp.dev.kyma.cloud.sap/runtimes?instance_id=0353ae55-e9ee-498c-a80d-80c2c0804c8b&op_detail=all&page=1&page_size=100 returned 401 (401 Unauthorized) status
      at KCPWrapper.exec (kcp/client.js:268:13)
      at processTicksAndRejections (internal/process/task_queues.js:95:5)
      at async KCPWrapper.runtimes (kcp/client.js:100:20)
      at async KCPWrapper.getRuntimeStatusOperations (kcp/client.js:161:27)
      at async Context.<anonymous> (skr-svcat-migration-test/skr-svcat-migration-test.js:74:27)

but looks like this is the case for many other tests. The last successful execution without this error among all fast-integration tests was yesterday 9pm.

wozniakjan commented 2 years ago

but looks like this is the case for many other tests. The last successful execution without this error among all fast-integration tests was yesterday 9pm.

it was a configuration error, fixed by the SRE now

and the pipeline is green prowgreen