Closed pbochynski closed 2 years ago
It looks like serverless component reconciler is not responding, all other components are reconciled:
The plan still fails. From very verbose logs I can see that still serverless component connot be installed despite the fact that the issue #324 was resolved. @m00g3n @varbanv - can you verify the current problem?
It should be fixed once https://github.com/kyma-project/test-infra/pull/4406 is merged. I'll monitor the status.
Looks like the nightly cluster install fails with:
/home/prow/go/src/github.com/kyma-project/test-infra/prow/scripts/cluster-integration/helpers/reconciler.sh: line 13: cd: /home/prow/go/src/github.com/kyma-project/control-plane/tools/reconciler: No such file or directory
Failed to change dir to: /home/prow/go/src/github.com/kyma-project/control-plane/tools/reconciler
Looks like the nightly tests are green again.
And something went wrong again:
2021/11/10 09:02:13 UTC [INFO] Kyma version to reconcile: main
sed: can't read ./e2e-test/template.json: No such file or directory
Currently failing because of jobs and statefulsets reconciliation issue https://github.com/kyma-incubator/reconciler/issues/420
Recent failures:
Problem with connection to postgresql
{"level":"DEBUG","time":"2021-11-22T03:26:17.133Z","message":"Worker pool is checking for processable operations (max parallel ops per cluster: 25)"}
{"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Bookkeeper failed to retrieve currently running reconciliations: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"ERROR","time":"2021-11-22T03:26:18.146Z","message":"Inventory watchers failed to fetch clusters to reconcile from inventory (using reconcile interval of 300 secs): failed to retrieve cluster-status-idents: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"WARN","time":"2021-11-22T03:26:18.146Z","message":"Worker pool failed to retrieve processable operations: dial tcp 100.66.23.99:5432: connect: connection refused"}
{"level":"WARN","time":"2021-11-22T03:26:18.147Z","message":"Worker pool failed to invoke all processable operations but will retry after 30.0 seconds again"}
Some relations don't exist:
{"level":"ERROR","time":"2021-11-22T03:28:47.127Z","message":"Postgres Query() error: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"ERROR","time":"2021-11-22T03:28:47.127Z","message":"Bookkeeper failed to retrieve currently running reconciliations: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"DEBUG","time":"2021-11-22T03:28:47.133Z","message":"Worker pool is checking for processable operations (max parallel ops per cluster: 25)"}
{"level":"ERROR","time":"2021-11-22T03:28:47.134Z","message":"Postgres Query() error: pq: relation \"scheduler_operations\" does not exist"}
{"level":"WARN","time":"2021-11-22T03:28:47.134Z","message":"Worker pool failed to retrieve processable operations: pq: relation \"scheduler_operations\" does not exist"}
{"level":"WARN","time":"2021-11-22T03:28:47.134Z","message":"Worker pool failed to invoke all processable operations but will retry after 30.0 seconds again"}
{"level":"ERROR","time":"2021-11-22T03:29:17.127Z","message":"Postgres Query() error: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"ERROR","time":"2021-11-22T03:29:17.127Z","message":"Bookkeeper failed to retrieve currently running reconciliations: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"ERROR","time":"2021-11-22T03:29:17.133Z","message":"Postgres Query() error: pq: relation \"inventory_cluster_config_statuses\" does not exist"}
{"level":"ERROR","time":"2021-11-22T03:29:17.133Z","message":"Inventory watchers failed to fetch clusters to reconcile from inventory (using reconcile interval of 300 secs): failed to retrieve cluster-status-idents: pq: relation \"inventory_cluster_config_statuses\" does not exist"}
Since we are using the same name for the cluster, we're hitting into the certificate request limit:
message: 'obtaining certificate failed: acme: error: 429 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order
:: urn:ietf:params:acme:error:rateLimited :: Error creating new order :: too
many certificates (5) already issued for this exact set of domains in the last
168 hours: *.rec-night.kyma-prow.shoot.canary.k8s-hana.ondemand.com: see https://letsencrypt.org/docs/rate-limits/'
That's why most of this job's executions are failing. I've already talked to the goats and it seems like there is no valid solution to this. So, I'm removing the nightly cluster logic here and instead the job will create a new cluster with a random name every time.
The new job looks like this: https://status.build.kyma-project.io/view/gs/kyma-prow-logs/logs/suleymanakbas91_test_of_prowjob_nightly-main-reconciler-e2e/1463157597619097600
It creates a Gardener cluster, deploys latest Reconciler, reconciles Kyma, runs fast-integration tests, breaks Kyma by deleting the deployments from kyma-system namespace, reconciles Kyma again, runs fast-integration tests, finally deletes the cluster.
I'm keeping my PR on hold, as the nightly job got stabilized again.
After some execution passed, today we have a new error:
sending reconciliation request to mothership-reconciler at: http://reconciler-mothership-reconciler.reconciler/v1/clusters
reconciliationResponse: {"cluster":"6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5","clusterVersion":1,"configurationVersion":1,"failures":null,"status":"reconcile_pending","statusURL":"http://reconciler-mothership-reconciler:80/v1/clusters/6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5/configs/1/status"}
RECONCILE_STATUS_URL: http://reconciler-mothership-reconciler:80/v1/clusters/6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5/configs/1/status
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconciling ....
...
Waiting for reconciliation to finish, current status: reconciling ....
Waiting for reconciliation to finish, current status: reconciling ....
Failed to reconcile Kyma. Exiting
command terminated with exit code 1
2021/11/25 07:11:06 UTC [ERROR] Failed to reconcile Kyma
error: a container name must be specified for pod mothership-reconciler-7fc7d95dfd-4qmdd, choose one of: [mothership-reconciler fluentbit-sidecar]
Latest issue will be fixed in https://github.com/kyma-project/test-infra/pull/4502
Logging problem is fixed. Now the real problem underneath is visible:
to cluster status 'error': Done=cluster-essentials,CRDs,certificates / Error=istio-configuration
[error: All attempts fail:\n#1: Could not update Istio: Error occurred when calling istioctl: exit status 1\n#2:
Could not update Istio: Error occurred when calling istioctl: exit status 1\n#3: Could not update Istio:
Error occurred when calling istioctl: exit status 1\n#4: Could not update Istio: Error occurred when calling istioctl:
exit status 1\n#5: Could not update Istio: Error occurred when calling istioctl: exit status 1] /
Other=rafter,cluster-users,serverless,tracing,kiali,ory,api-gateway,service-catalog,logging,eventing,service-catalog-addons,monitoring,helm-broker,application-connector"}
This might be related to this change. @dariusztutaj
Description Job fails due to some database error (missing relation):
https://status.build.kyma-project.io/job-history/gs/kyma-prow-logs/logs/nightly-main-reconciler-e2e