Harbor is unhealthy due to unready postgresql, which is ready instead

karaguo commented 2 years ago

Expected behavior and actual behavior: There is a flaky issue spotted during harbor deployment. HarborCluster became unhealthy (at this moment postgresql and redis pods are ready and harbor components' pods are not initiated), and the only abnormal status when we check the harborCluster is

Last Transition Time:  2022-04-26T21:29:13Z
message: psql is Creating
reason: Database is not ready

however, the postgresql's status is RUNNING, and the harbor operator pod also logged as below:

2022-04-26T21:29:13.217Z    INFO    harbor-operator.controller.database    Creating Database.
{"controller": "harborcluster", "version": "v1.1.1-***", "git.commit": "none", 
"namespace": "***", "name": "<postgresql cluster name>"}                                                                                                      
2022-04-26T21:29:13.246Z    INFO    harbor-operator.controller.database    Database create complete.    
{"controller": "harborcluster", "version": "v1.1.1-***", "git.commit" : "none", 
"namespace": "***", "name": "<postgresql cluster name>"}

So it shows that there is nothing wrong on postgresql DB.

Restarting harbor operator pod fixed the issue. After restarting the harbor operator pod, the harbor components are deployed immediately and harborCluster becomes healthy. Therefore, it is possible that harbor operator doesn't properly show the status and report an unhealthy status, which might be a false positive. Hi team can you please give more insights on this issue? Also our team found a recent change on pg status check such as https://github.com/goharbor/harbor-operator/pull/476/files. Can you please help take a look and see if this is a regression? Thanks!

Steps to reproduce the problem:

changed storage class to support ReadWriteMany
changed harbor pvc access mode from ReadWriteOnce to ReadWriteMany

We have not found the connection between the storage class/pvc change and this harbor health issue. However, this flaky issue happened after the change was merged, and didn't occur after it is reverted.

Versions: Please specify the versions of following systems.

harbor operator version: [v1.1.1]
harbor version: [v2.4.1]
kubernetes version: [v1.22.8]
Any additional relevant versions such as CertManager
postgresql operator version: 1.5.0
redis operator version: 1.0.0

Additional context:

Harbor dependent services:
- Context info of postgreSQL - version 14.2
- Context info of Redis
- Context info of storage
Log files: Collect logs and attach them here if have.
Kubernetes: How Kubernetes access was provided (what cloud provider, service-account configuration, ...).

karaguo commented 2 years ago

A follow up (status update):

Even without the additional changes mentioned at Steps to reproduce the problem, this issue occurred again. This means that this happens at a low possibility, but makes the system unstable.

My guess is that the postgresql database readiness check at https://github.com/goharbor/harbor-operator/blob/544d6737c197a5fefb9c729ae8785614f77ab005/pkg/cluster/controllers/database/readiness.go#L54 thinks that database is not ready becuase the status is creating. This might be able to be fixed after adding a retry or wait for the status turning from creating to running.

Can someone help confirm?

karaguo commented 2 years ago

Update:

To throw the error at following place is supposed to fix the issue. https://github.com/goharbor/harbor-operator/blob/544d6737c197a5fefb9c729ae8785614f77ab005/pkg/cluster/controllers/database/readiness.go#L58

goharbor / harbor-operator

Harbor is unhealthy due to unready postgresql, which is ready instead #885