kyma-incubator / reconciler

Kyma reconciler
Apache License 2.0
10 stars 67 forks source link

nightly-main-reconciler-e2e job fails #322

Closed pbochynski closed 2 years ago

pbochynski commented 3 years ago

Description Job fails due to some database error (missing relation):

{"level":"ERROR","time":"2021-10-28T09:02:40.612Z","message":"Postgres Query() error: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"ERROR","time":"2021-10-28T09:02:40.612Z","message":"Bookkeeper failed to retrieve currently running reconciliations: pq: relation \"scheduler_reconciliations\" does not exist"}
{"level":"DEBUG","time":"2021-10-28T09:02:40.631Z","message":"Worker pool is checking for processable operations (max parallel ops per cluster: 25)"}
{"level":"DEBUG","time":"2021-10-28T09:02:40.632Z","message":"Postgres Query(): SELECT priority, scheduling_id, correlation_id, runtime_id, cluster_config, component, state, reason, created, updated FROM scheduler_operations WHERE scheduling_id IN (SELECT scheduling_id FROM scheduler_reconciliations WHERE finished=$1) | [false]"}
{"level":"ERROR","time":"2021-10-28T09:02:40.633Z","message":"Postgres Query() error: pq: relation \"scheduler_operations\" does not exist"}
{"level":"WARN","time":"2021-10-28T09:02:40.633Z","message":"Worker pool failed to retrieve processable operations: pq: relation \"scheduler_operations\" does not exist"}
{"level":"WARN","time":"2021-10-28T09:02:40.633Z","message":"Worker pool failed to invoke all processable operations but will retry after 30.0 seconds again"}
{"level":"DEBUG","time":"2021-10-28T09:03:10.610Z","message":"Postgres Query(): SELECT lock, runtime_id, cluster_config, cluster_config_status, finished, scheduling_id, created, updated FROM scheduler_reconciliations WHERE finished=$1 | [false]"}

https://status.build.kyma-project.io/job-history/gs/kyma-prow-logs/logs/nightly-main-reconciler-e2e

varbanv commented 3 years ago

It looks like serverless component reconciler is not responding, all other components are reconciled:


{"level":"WARN","time":"2021-10-28T17:14:54.143Z","message":"Remote invoker failed to send HTTP request to component reconciler 'http://serverless-reconciler:8080/v1/run': Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused"}
{"level":"DEBUG","time":"2021-10-28T17:14:58.038Z","message":"Postgres QueryRow(): SELECT version, runtime_id, cluster_version, kyma_version, kyma_profile, components, administrators, contract, deleted, created FROM inventory_cluster_configs WHERE runtime_id=$1 AND version=$2 | [6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5 1]"}
{"level":"DEBUG","time":"2021-10-28T17:14:58.040Z","message":"Postgres QueryRow(): SELECT id, runtime_id, cluster_version, config_version, status, deleted, created FROM inventory_cluster_config_statuses WHERE config_version=$1 ORDER BY id desc | [1]"}
{"level":"DEBUG","time":"2021-10-28T17:14:58.041Z","message":"Postgres QueryRow(): SELECT version, runtime_id, runtime, metadata, kubeconfig, contract, deleted, created FROM inventory_clusters WHERE deleted=$1 AND version=$2 | [false 1]"}
{"level":"DEBUG","time":"2021-10-28T17:15:04.195Z","message":"Worker calls invoker for operation 'OperationEntity [SchedulingID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--afdda14d-5eef-479e-8f99-3e4358ae0cd8,CorrelationID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--1104992f-6a13-4da2-9fb9-62ca8553f056,RuntimeID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5,ClusterConfig=1,Component=serverless,Prio=5]' (in retryable function)"}
{"level":"DEBUG","time":"2021-10-28T17:15:04.195Z","message":"Remote invoker found dedicated reconciler for component 'serverless'"}
{"level":"DEBUG","time":"2021-10-28T17:15:04.195Z","message":"Remote invoker is calling remote reconciler via HTTP (URL: http://serverless-reconciler:8080/v1/run) for component 'serverless' (schedulingID:6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--afdda14d-5eef-479e-8f99-3e4358ae0cd8/correlationID:6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--1104992f-6a13-4da2-9fb9-62ca8553f056)"}
{"level":"WARN","time":"2021-10-28T17:15:05.211Z","message":"Remote invoker failed to send HTTP request to component reconciler 'http://serverless-reconciler:8080/v1/run': Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused"}
{"level":"WARN","time":"2021-10-28T17:15:05.211Z","message":"Worker stops processing operation 'OperationEntity [SchedulingID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--afdda14d-5eef-479e-8f99-3e4358ae0cd8,CorrelationID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--1104992f-6a13-4da2-9fb9-62ca8553f056,RuntimeID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5,ClusterConfig=1,Component=serverless,Prio=5]' because invoker returned consistently errors (2 retries): All attempts fail:\n#1: failed to call remote reconciler (URL: http://serverless-reconciler:8080/v1/run): Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused\n#2: failed to call remote reconciler (URL: http://serverless-reconciler:8080/v1/run): Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused"}
{"level":"WARN","time":"2021-10-28T17:15:05.211Z","message":"Worker pool received an error from worker assigned to operation 'OperationEntity [SchedulingID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--afdda14d-5eef-479e-8f99-3e4358ae0cd8,CorrelationID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5--1104992f-6a13-4da2-9fb9-62ca8553f056,RuntimeID=6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5,ClusterConfig=1,Component=serverless,Prio=5]': All attempts fail:\n#1: failed to call remote reconciler (URL: http://serverless-reconciler:8080/v1/run): Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused\n#2: failed to call remote reconciler (URL: http://serverless-reconciler:8080/v1/run): Post \"http://serverless-reconciler:8080/v1/run\": dial tcp 100.68.126.245:8080: connect: connection refused"}
{"level":"DEBUG","time":"2021-10-28T17:15:13.087Z","message":"Postgres QueryRow(): SELECT version, runtime_id, cluster_version, kyma_version, kyma_profile, components, administrators, contract, deleted, created FROM inventory_cluster_configs WHERE runtime_id=$1 AND version=$2 | [6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5 1]"}```
varbanv commented 3 years ago

Caused by: https://github.com/kyma-incubator/reconciler/issues/324

pbochynski commented 3 years ago

The plan still fails. From very verbose logs I can see that still serverless component connot be installed despite the fact that the issue #324 was resolved. @m00g3n @varbanv - can you verify the current problem?

suleymanakbas91 commented 3 years ago

It should be fixed once https://github.com/kyma-project/test-infra/pull/4406 is merged. I'll monitor the status.

varbanv commented 3 years ago

Looks like the nightly cluster install fails with:

/home/prow/go/src/github.com/kyma-project/test-infra/prow/scripts/cluster-integration/helpers/reconciler.sh: line 13: cd: /home/prow/go/src/github.com/kyma-project/control-plane/tools/reconciler: No such file or directory
Failed to change dir to: /home/prow/go/src/github.com/kyma-project/control-plane/tools/reconciler
varbanv commented 3 years ago

Looks like the nightly tests are green again.

pbochynski commented 3 years ago

And something went wrong again:

2021/11/10 09:02:13 UTC [INFO] Kyma version to reconcile: main
sed: can't read ./e2e-test/template.json: No such file or directory

Logs: https://storage.googleapis.com/kyma-prow-logs/logs/nightly-main-reconciler-e2e/1458367586910605312/build-log.txt

varbanv commented 3 years ago

Currently failing because of jobs and statefulsets reconciliation issue https://github.com/kyma-incubator/reconciler/issues/420

pbochynski commented 2 years ago

Recent failures:

  1. Problem with connection to postgresql

    {"level":"DEBUG","time":"2021-11-22T03:26:17.133Z","message":"Worker pool is checking for processable operations (max parallel ops per cluster: 25)"}
    {"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Bookkeeper failed to retrieve currently running reconciliations: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"ERROR","time":"2021-11-22T03:26:18.146Z","message":"Inventory watchers failed to fetch clusters to reconcile from inventory (using reconcile interval of 300 secs): failed to retrieve cluster-status-idents: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"ERROR","time":"2021-11-22T03:26:18.145Z","message":"Postgres Query() error: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"WARN","time":"2021-11-22T03:26:18.146Z","message":"Worker pool failed to retrieve processable operations: dial tcp 100.66.23.99:5432: connect: connection refused"}
    {"level":"WARN","time":"2021-11-22T03:26:18.147Z","message":"Worker pool failed to invoke all processable operations but will retry after 30.0 seconds again"}
  2. Some relations don't exist:

    {"level":"ERROR","time":"2021-11-22T03:28:47.127Z","message":"Postgres Query() error: pq: relation \"scheduler_reconciliations\" does not exist"}
    {"level":"ERROR","time":"2021-11-22T03:28:47.127Z","message":"Bookkeeper failed to retrieve currently running reconciliations: pq: relation \"scheduler_reconciliations\" does not exist"}
    {"level":"DEBUG","time":"2021-11-22T03:28:47.133Z","message":"Worker pool is checking for processable operations (max parallel ops per cluster: 25)"}
    {"level":"ERROR","time":"2021-11-22T03:28:47.134Z","message":"Postgres Query() error: pq: relation \"scheduler_operations\" does not exist"}
    {"level":"WARN","time":"2021-11-22T03:28:47.134Z","message":"Worker pool failed to retrieve processable operations: pq: relation \"scheduler_operations\" does not exist"}
    {"level":"WARN","time":"2021-11-22T03:28:47.134Z","message":"Worker pool failed to invoke all processable operations but will retry after 30.0 seconds again"}
    {"level":"ERROR","time":"2021-11-22T03:29:17.127Z","message":"Postgres Query() error: pq: relation \"scheduler_reconciliations\" does not exist"}
    {"level":"ERROR","time":"2021-11-22T03:29:17.127Z","message":"Bookkeeper failed to retrieve currently running reconciliations: pq: relation \"scheduler_reconciliations\" does not exist"}
    {"level":"ERROR","time":"2021-11-22T03:29:17.133Z","message":"Postgres Query() error: pq: relation \"inventory_cluster_config_statuses\" does not exist"}
    {"level":"ERROR","time":"2021-11-22T03:29:17.133Z","message":"Inventory watchers failed to fetch clusters to reconcile from inventory (using reconcile interval of 300 secs): failed to retrieve cluster-status-idents: pq: relation \"inventory_cluster_config_statuses\" does not exist"}
suleymanakbas91 commented 2 years ago

Since we are using the same name for the cluster, we're hitting into the certificate request limit:

 message: 'obtaining certificate failed: acme: error: 429 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order
      :: urn:ietf:params:acme:error:rateLimited :: Error creating new order :: too
      many certificates (5) already issued for this exact set of domains in the last
      168 hours: *.rec-night.kyma-prow.shoot.canary.k8s-hana.ondemand.com: see https://letsencrypt.org/docs/rate-limits/'

That's why most of this job's executions are failing. I've already talked to the goats and it seems like there is no valid solution to this. So, I'm removing the nightly cluster logic here and instead the job will create a new cluster with a random name every time.

suleymanakbas91 commented 2 years ago

The new job looks like this: https://status.build.kyma-project.io/view/gs/kyma-prow-logs/logs/suleymanakbas91_test_of_prowjob_nightly-main-reconciler-e2e/1463157597619097600

It creates a Gardener cluster, deploys latest Reconciler, reconciles Kyma, runs fast-integration tests, breaks Kyma by deleting the deployments from kyma-system namespace, reconciles Kyma again, runs fast-integration tests, finally deletes the cluster.

suleymanakbas91 commented 2 years ago

I'm keeping my PR on hold, as the nightly job got stabilized again.

pbochynski commented 2 years ago

After some execution passed, today we have a new error:

sending reconciliation request to mothership-reconciler at: http://reconciler-mothership-reconciler.reconciler/v1/clusters
reconciliationResponse: {"cluster":"6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5","clusterVersion":1,"configurationVersion":1,"failures":null,"status":"reconcile_pending","statusURL":"http://reconciler-mothership-reconciler:80/v1/clusters/6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5/configs/1/status"}
RECONCILE_STATUS_URL: http://reconciler-mothership-reconciler:80/v1/clusters/6f8ec1bc-f300-4f00-bb65-6fec5dcd0ba5/configs/1/status
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconcile_pending ....
Waiting for reconciliation to finish, current status: reconciling ....
...
Waiting for reconciliation to finish, current status: reconciling ....
Waiting for reconciliation to finish, current status: reconciling ....
Failed to reconcile Kyma. Exiting
command terminated with exit code 1
2021/11/25 07:11:06 UTC [ERROR] Failed to reconcile Kyma
error: a container name must be specified for pod mothership-reconciler-7fc7d95dfd-4qmdd, choose one of: [mothership-reconciler fluentbit-sidecar]
tobiscr commented 2 years ago

Latest issue will be fixed in https://github.com/kyma-project/test-infra/pull/4502

suleymanakbas91 commented 2 years ago

Logging problem is fixed. Now the real problem underneath is visible:

to cluster status 'error': Done=cluster-essentials,CRDs,certificates / Error=istio-configuration
[error: All attempts fail:\n#1: Could not update Istio: Error occurred when calling istioctl: exit status 1\n#2: 
Could not update Istio: Error occurred when calling istioctl: exit status 1\n#3: Could not update Istio: 
Error occurred when calling istioctl: exit status 1\n#4: Could not update Istio: Error occurred when calling istioctl: 
exit status 1\n#5: Could not update Istio: Error occurred when calling istioctl: exit status 1] / 
Other=rafter,cluster-users,serverless,tracing,kiali,ory,api-gateway,service-catalog,logging,eventing,service-catalog-addons,monitoring,helm-broker,application-connector"}

This might be related to this change. @dariusztutaj