Closed pbochynski closed 2 years ago
@PK85 asked me to try locally and indeed, smctl
fails to login with the credentials
$ smctl login -a "$(jq --raw-output '.sm_url' creds.json)" --param subdomain=e2etestingscmigration --auth-flow client-credentials --client-id "$(jq --raw-output '.clientid' creds.json)" --client-secret "$(jq --raw-output '.clientsecret' creds.json)" -v
DEBU[0000] Sending request GET https://service-manager.cfapps.sap.hana.ondemand.com/v1/info?subdomain=e2etestingscmigration component="smclient/client.go:509" correlation_id=-
DEBU[0000] authenticator: oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"unauthorized","error_description":"Client Authentication failed."} component="oidc/oidc.go:74" correlation_id=-
DEBU[0000] oidc error: oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"unauthorized","error_description":"Client Authentication failed."} component="oidc/oidc.go:108" correlation_id=-
Error: could not login
Reason: auth error: Client Authentication failed.
maybe something in SM has changed and they stopped supporting this auth workflow?
hmm, looks like bindings for the SM instance got removed, I guess we will need new binding then..
getting 401 even with new creds, contacted @michal-keidar for ideas to address this
I recommend that you open an NGPBUG ticket to the dev team (component “Service Manager”). They have made some changes recently due to Crown Jewles so they will be able to investigate and assist.
here is the ticket https://jtrack.wdf.sap.corp/browse/NGPBUG-172313
Status: ticket created for SM team, still noone assigned on their side. Bumped NGPBUG ticket priority to critical.
Status: we resolved the issue with SM Instance binding. Now the test pipeline is failing because of random behavior of sm-proxy. In the test we are trying to provision 3 Service Instances from SM, but not all Service Brokers are present in the cluster. Depending on the test run, we see 1 or 2 random Service Brokers from SM, but not all required 3 and that's what we are trying to resolve now.
Right now this is a blocker for that pipeline: https://github.com/kyma-project/kyma/issues/12945
One of the root issue was that service-manager-proxy
with concurrent installation by reconciler was trying to create ClusterServiceBrokers when service-catalog components were not ready yet:
time="2021-12-30T09:45:54Z" level=error msg="Internal error occurred: failed calling webhook \"mutating.clusterservicebrokers.servicecatalog.k8s.io\": Post \"https://service-catalog-catalog-webhook.kyma-system.svc:443/mutating-clusterservicebrokers?timeout=30s\": no endpoints available for service \"service-catalog-catalog-webhook\"" component="reconcile/task_scheduler.go:43" correlation_id=0293c363-71ef-4222-a4ef-5faf97ad4bd3
This one was resolved in: https://github.com/kyma-project/kyma/pull/12959
Last issue for now is sc-removal
job being unable to remove finalizers from UsageKind
. It was working fine with Kyma 1.24.8
.The job fails with error:
panic: Operation cannot be fulfilled on usagekinds.servicecatalog.kyma-project.io "serverless-function": the object has been modified; please apply your changes to the latest version and try again
We think that might be caused by reconciler reverting the state of the CR because of annotation reconciler.kyma-project.io/managed-by-reconciler-disclaimer
:
apiVersion: servicecatalog.kyma-project.io/v1alpha1
kind: UsageKind
metadata:
annotations:
reconciler.kyma-project.io/managed-by-reconciler-disclaimer: |-
DO NOT EDIT - This resource is managed by Kyma.
Any modifications are discarded and the resource is reverted to the original state.
creationTimestamp: "2021-12-30T15:28:48Z"
finalizers:
- servicecatalog.kyma-project.io/usage-kind-protection
generation: 1
labels:
reconciler.kyma-project.io/managed-by: reconciler
reconciler.kyma-project.io/origin-version: PR-12959
name: serverless-function
resourceVersion: "10816"
uid: 07516338-e30f-4c3f-b69c-c56ff5909db8
spec:
displayName: Function
labelsPath: spec.labels
resource:
group: serverless.kyma-project.io
kind: function
version: v1alpha1
This seems to be quite random, because manual intervention during the pipeline execution (deleting migration job Pod) sometimes helps and the job manages to remove finalizers from UsageKind.
Possible workaround to verify:
Resolving sc-removal
issues will require additional work with custom reconciler, but it does not block Kyma 2.0 release, only the service management migration (cc: @wozniakjan )
Problems with ClusterServiceBrokers being unavailable on the clusters in tests were mitigated by adding initContainer to service-manager-proxy and will be cherry-picked here:
Regarding those, I am removing release-blocker
label from this one.
The pipeline is still red - waiting for update operation.
currently, the failure is related to the btp-operator
selfsigned cert. There was a request to move btp-operator to a different namespace which was done in https://github.tools.sap/kyma/management-plane-config/pull/1304, but the self-signed cert was not adjusted.
binding 'func-sb-svcat-html5-apps-repo-1' in namespace 'default' failed: 'Internal error occurred: failed calling webhook "mservicebinding.kb.io": Post "https://sap-btp-operator-webhook-service.kyma-system.svc:443/mutate-services-cloud-sap-com-v1alpha1-servicebinding?timeout=10s": x509: certificate is valid for sap-btp-operator-webhook-service.sap-btp-operator.svc, sap-btp-operator-webhook-service.sap-btp-operator.svc.cluster.local, not sap-btp-operator-webhook-service.kyma-system.svc'
fwiw, the sc-removal
was supposed to be deprecated by sc-migration
reconciler but constant shift in priorities slowed down that development. I don't think it is worth the effort at the moment to work on addressing all other issues with sc-removal
chart and rather we should finish the sc-migration
reconciler task.
The sc-migration
reconciler can be tracked here: https://github.com/kyma-incubator/reconciler/pull/389
moving forward with two more findings regarding the pipeline 1) one already reported by https://github.com/kyma-project/kyma/issues/12843#issuecomment-1003089390
panic: Operation cannot be fulfilled on usagekinds.servicecatalog.kyma-project.io "serverless-function": the object has been modified; please apply your changes to the latest version and try again
2) the namespace for btp-operator and migrator was moved from sap-btp-operator
which is default in https://github.com/SAP/sap-btp-service-operator/, to kyma-system
. There is one more spot that needs to be changed:
https://github.com/kyma-project/kyma/blob/4c7fdfa6f5bce804e44af0be5ed5c4ecdc509c8f/tests/fast-integration/skr-svcat-migration-test/skr-svcat-migration-test.js#L115
the above two are fixed, the next one is something is wrong with the kubeconfig inside of the test pipeline
1) SKR SVCAT migration test
Should check if pod presets injected secrets to functions containers:
Error: failed to execute kubectl exec svcat-auditlog-api-1-t79n4-5bdc775c95-v48x5 -c function -n default -- sh -c for v in uaa url vendor; do x="$(eval echo \$$v)"; if [[ -z "$x" ]]; then echo missing $v env variable; exit 1; else echo found $v env variable; fi; done:
,
The connection to the server localhost:8080 was refused - did you specify the right host or port?
addressed in https://github.com/kyma-project/kyma/pull/13064
light at the end of the tunnel
1) SKR SVCAT migration test
Should check if pod presets injected secrets to functions containers:
Error: failed to execute kubectl exec svcat-auditlog-api-1-vj6tk-7c54fc57cc-g7f2c -c function -n default -- sh -c for v in uaa url vendor; do x="$(eval echo \$$v)"; if [[ -z "$x" ]]; then echo missing $v env variable; exit 1; else echo found $v env variable; fi; done:
missing uaa env variable,
command terminated with exit code 1
at kubectlExecInPod (utils/index.js:693:11)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async Object.checkPodPresetEnvInjected (skr-svcat-migration-test/test-helpers.js:78:9)
at async Context.<anonymous> (skr-svcat-migration-test/skr-svcat-migration-test.js:110:5)
in kyma1.x, the pod preset containers for svcat-auditlog-api
had these env variables set uaa
, url
, vendor
by SBUs. But for kyma2.x after the migration and cleanup, none of the three env vars is injected. After discussing with @piotrmiskiewicz and @voigt, we decided that is expected due to the current state of the implementation but not desired. Right now the quick path forward is to put logic to https://github.com/kyma-incubator/sc-removal replacing deprecated SBUs by mounting the secrets directly in the pod preset containers, and later propagate that to sc-migration
reconciler.
Right now the quick path forward is to put logic to https://github.com/kyma-incubator/sc-removal replacing deprecated SBUs by mounting the secrets directly in the pod preset containers, and later propagate that to
sc-migration
reconciler.
~turns out this is not that quick of a path. Any change on the deployment is reverted by functions-controller
instantly, so we can't easily add a mount to the deployment referencing the binding secret and inject env vars. Pod has those fields as immutable so we can't put it there either. The only thing we can do is a webhook on pods which is exactly how SBUs are implemented afaik at which point it might be easier just to keep the SBUs in place.~
never mind, looks like Functions have env vars as well, I will try to plug it there https://github.com/kyma-incubator/sc-removal/pull/13
current failures are across multiple different tests very similar
1) SKR SVCAT migration test
Should deprovision SKR:
Error: the string "Error: wait timeout ..."
Could be an outage on Service Manager side
Should cleanup platform --cascade, operator instances and bindings:
Error: failed "smctl deprovision btp-operator-xprc -f --mode=sync": Error: request DELETE https://service-manager.cfapps.sap.hana.ondemand.com/v1/service_instances/f3fff25c-1e23-4a61-9a30-94dc5add9b4b?async=false failed: StatusCode: 502 Body: {"error":"BrokerError","description":"Failed deprovisioning request instanceID: f3fff25c-1e23-4a61-9a30-94dc5add9b4b, planID: 136d6248-1bed-45e3-912a-f553406c3ab5, serviceID: 6e6cc910-c2f7-4b95-a725-c986bb51bad7, acceptsIncomplete: true: Status: 400; ErrorMessage: \u003cnil\u003e; Description: error occurred while executing deprovision operation. Please contact Service Manager broker administrator; ResponseError: \u003cnil\u003e"
but other parts of the test pipeline are passing.
another fix related to this: https://github.com/kyma-project/kyma/pull/13174, we were leaking resources, btp-operator creds couldn't be deleted as a result
the last outstanding failure cause in the tests is a timeout on deprovisioning. It came to my knowledge that is due to https://github.com/kyma-incubator/reconciler/issues/647 and we shouldn't conceal the problem by increasing deprovisioning timeout.
I will keep this issue open for now and passively monitor the resolution of https://github.com/kyma-incubator/reconciler/issues/647
a brand new error started appearing today morning
1) SKR SVCAT migration test
Should get Runtime Status after provisioning:
Error: kcp command failed: Error: while listing runtimes: calling https://kyma-env-broker.cp.dev.kyma.cloud.sap/runtimes?instance_id=0353ae55-e9ee-498c-a80d-80c2c0804c8b&op_detail=all&page=1&page_size=100 returned 401 (401 Unauthorized) status
at KCPWrapper.exec (kcp/client.js:268:13)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async KCPWrapper.runtimes (kcp/client.js:100:20)
at async KCPWrapper.getRuntimeStatusOperations (kcp/client.js:161:27)
at async Context.<anonymous> (skr-svcat-migration-test/skr-svcat-migration-test.js:74:27)
but looks like this is the case for many other tests. The last successful execution without this error among all fast-integration tests was yesterday 9pm.
but looks like this is the case for many other tests. The last successful execution without this error among all fast-integration tests was yesterday 9pm.
it was a configuration error, fixed by the SRE now
and the pipeline is green
Description
The tests fails on smctl login:
Job history: https://status.build.kyma-project.io/job-history/gs/kyma-prow-logs/logs/skr-aws-svcat-migration-dev Sample failed test log: https://storage.googleapis.com/kyma-prow-logs/logs/skr-aws-svcat-migration-dev/1470302568000262144/build-log.txt