oracle / oracle-database-operator

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases. It eliminates the dependency on a human operator or administrator for the majority of database operations.
Universal Permissive License v1.0
141 stars 45 forks source link

DG Broker doesn't detect that primary database is down #110

Open andbos opened 5 months ago

andbos commented 5 months ago

Hi,

It seems DG Broker is not able to detect that primary database is down, the status is Healthy all the time.

Standby detected that the primary is down:

 rfs (PID:2778): Possible network disconnect with primary database [krsv.c:4855]
 rfs (PID:2778): while processing B-1171608090.T-1.S-21 [krsv.c:4861]
2024-06-14T10:18:55.090551+00:00

***********************************************************************

Fatal NI connect error 12541, connecting to:
 (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=db11)(INSTANCE_NAME=DB11)(CID=(PROGRAM=oracle)(HOST=sinchdb12-dwake)(USER=oracle))(CONNECTION_ID=Gtegq8X5CdTgYxoCAQoNsQ==))(ADDRESS=(PROTOCOL=tcp)(HOST=172.20.114.142)(PORT=1521)))

  VERSION INFORMATION:
        TNS for Linux: Version 21.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 21.0.0.0.0 - Production
  Version 21.3.0.0.0
  Time: 14-JUN-2024 10:18:55
  Tracing not turned on. Process Id = 2516
  Tns error struct:
    ns main err code: 12541

TNS-12541: TNS:no listener
    ns secondary err code: 12560
    nt main err code: 511

TNS-00511: No listener
    nt secondary err code: 111
    nt OS err code: 0

***********************************************************************

But not DG Broker even though the status of the primary is Pending:

$ date
Fri Jun 14 12:41:20 CEST 2024

$ kubectl -n oracle-database get pods
NAME               READY   STATUS     RESTARTS   AGE
db11-nfwnl         0/1     Init:1/2   0          23m
db12-dwake         1/1     Running    0          54m

$ kubectl -n oracle-database get singleinstancedatabase
NAME         EDITION      STATUS    ROLE               VERSION      CONNECT STR                 TCPS CONNECT STR   OEM EXPRESS URL
db11         Enterprise   Pending   PRIMARY            21.3.0.0.0   10.1.1.161:32480/DB11       Unavailable        https://10.1.1.161:30473/em
db12         Enterprise   Healthy   PHYSICAL_STANDBY   21.3.0.0.0   10.1.2.200:30739/DB12       Unavailable        https://10.1.2.200:30875/em

$ kubectl -n oracle-database get dataguardbroker
NAME                 PRIMARY   STANDBYS   PROTECTION MODE   CONNECT STR                  STATUS
dataguardbroker-db   DB11      DB12       MaxAvailability   10.1.1.161:31036/DATAGUARD   Healthy

$ kubectl -n oracle-database describe dataguardbroker
Name:         dataguardbroker-db
Namespace:    oracle-database
Labels:       <none>
Annotations:  <none>
API Version:  database.oracle.com/v1alpha1
Kind:         DataguardBroker
Metadata:
  Creation Timestamp:  2024-06-14T09:55:59Z
  Finalizers:
    database.oracle.com/dataguardbrokerfinalizer
  Generation:        1
  Resource Version:  94229431
  UID:               d5703585-503c-46f6-be34-9a3bdb3a40df
Spec:
  Fast Start Fail Over:
  Primary Database Ref:     db11
  Protection Mode:          MaxAvailability
  Set As Primary Database:  DB11
  Standby Database Refs:
    db12
Status:
  Cluster Connect String:   dataguardbroker-db.oracle-database:1521/DATAGUARD
  External Connect String:  10.1.1.161:31036/DATAGUARD
  Primary Database:         DB11
  Primary Database Ref:     db11
  Protection Mode:          MaxAvailability
  Standby Databases:        DB12
  Status:                   Healthy
Events:
  Type    Reason                       Age   From             Message
  ----    ------                       ----  ----             -------
  Normal  DG Configuration up to date  45m   DataguardBroker

Setup: one primary singleinstancedatabase and one standby singleinstancedatabase, both using image enterprise:21.3.0.0. OraOperator version: 1.1.0.

Best regards, Andreas

andbos commented 5 months ago

Is it possible somehow to failover to the standby? In this case the primary instance is beyond rescue and I can't switchover because the primary is unreachable.

If I had a clone I suppose I could have pointed the application to SID of the clone and then created a new standby and configured DG Broker again. In a disaster recovery situation, I mean. Or another option could be to have multiple replicas of primary.

Best regards, Andreas

IshaanDesai45 commented 5 months ago

@andbos right now we only support manual switchover to the standby when all the database are healthy as since DataguardBroker controller is still in the preview release.

We plan to implement failover we need an database observer, which is roadmap item for the next release and will be implemented in v1.2.0

andbos commented 5 months ago

Hi,

Thanks for the update, appreciated. Roughly when could we expect v1.2.0?

Best regards, Andreas

IshaanDesai45 commented 5 months ago

We are still discussing on the timeline for 1.2.0

P.S - if you want to switchover when the primary is down with the current implementation of the DataguardController. You can exec into the standby database and manually run the DGMGRL command for the switchover.

  1. Log into the DGMGRL shell DGMGRL sys@<pwd>
  2. run SWITCHOVER TO <standby_database_sid>
andbos commented 5 months ago

Hi,

Thanks for the tip. To start with, I tried executing SWITCHOVER TO when the primary was up - it worked:

DGMGRL for Linux: Release 21.0.0.0.0 - Production on Mon Jun 24 09:59:33 2024
Version 21.13.0.0.0

Copyright (c) 1982, 2021, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "DB11"
Connected as SYSDBA.
DGMGRL> show configuration

Configuration - dg_config

  Protection Mode: MaxAvailability
  Members:
  db11 - Primary database
    db12 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 27 seconds ago)

DGMGRL> switchover to db12
2024-06-24T09:59:45.339+00:00
Performing switchover NOW, please wait...

2024-06-24T09:59:45.490+00:00
Operation requires a connection to database "db12"
Connecting ...
Connected to "DB12"
Connected as SYSDBA.

2024-06-24T09:59:45.534+00:00
Continuing with the switchover...

2024-06-24T09:59:52.273+00:00
New primary database "db12" is opening...

2024-06-24T09:59:52.273+00:00
Operation requires start up of instance "DB11" on database "db11"
Starting instance "DB11"...
Connected to an idle instance.
ORACLE instance started.
Connected to "DB11"
Database mounted.
Database opened.

Connected to "DB11"
2024-06-24T10:00:10.370+00:00
Switchover succeeded, new primary is "db12"

2024-06-24T10:00:10.373+00:00
Switchover processing complete, broker ready.
DGMGRL> show configuration

Configuration - dg_config

  Protection Mode: MaxAvailability
  Members:
  db12 - Primary database
    db11 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 18 seconds ago)

However, the DataGuardBroker didn't notice that a switchover was done.

$ date
Mon Jun 24 12:02:09 CEST 2024
$ kubectl --kubeconfig ~/.kube/config-sinch-op-smsf-1-andbos -n oracle-database get dataguardbroker
NAME            PRIMARY   STANDBYS   PROTECTION MODE   CONNECT STR                  STATUS
sidb-dgbroker   DB11      DB12       MaxAvailability   10.1.1.161:32495/DATAGUARD   Healthy

Best regards, Andreas

andbos commented 5 months ago

But when I restarted the new standby (db11) DataGuardBroker reported status Healthy the whole time despite active ORA errors:

DGMGRL> show configuration

Configuration - dg_config

  Protection Mode: MaxAvailability
  Members:
  db12 - Primary database
    Error: ORA-16810: multiple errors or warnings detected for the member

    db11 - Physical standby database
      Error: ORA-16599: Oracle Data Guard broker detected a stale configuration

Fast-Start Failover:  Disabled

Configuration Status:
ERROR   (status updated 51 seconds ago)

Not even any events:

$ kubectl -n oracle-database describe dataguardbroker
Name:         sidb-dgbroker
Namespace:    oracle-database
Labels:       <none>
Annotations:  <none>
API Version:  database.oracle.com/v1alpha1
Kind:         DataguardBroker
Metadata:
  Creation Timestamp:  2024-06-24T09:47:05Z
  Finalizers:
    database.oracle.com/dataguardbrokerfinalizer
  Generation:        1
  Resource Version:  133480469
  UID:               de3110c0-78e2-401a-9f91-c245b8519273
Spec:
  Fast Start Fail Over:
  Primary Database Ref:     sidb11
  Protection Mode:          MaxAvailability
  Set As Primary Database:  DB11
  Standby Database Refs:
    sidb12
Status:
  Cluster Connect String:   sidb-dgbroker.oracle-database:1521/DATAGUARD
  External Connect String:  10.1.1.161:32495/DATAGUARD
  Primary Database:         DB11
  Primary Database Ref:     sidb11
  Protection Mode:          MaxAvailability
  Standby Databases:        DB12
  Status:                   Healthy
Events:
  Type    Reason                       Age   From             Message
  ----    ------                       ----  ----             -------
  Normal  DG Configuration up to date  20m   DataguardBroker
IshaanDesai45 commented 5 months ago

@andbos that is true the dgbroker would not detect the switchover in this case when we do is manually. This is because since DGBroker controller reconcile has not been triggered.

To detect change in config manually as well we would depend on the database observer which is planned for the next release

andbos commented 5 months ago

Ok, I see. Worse is that if I take down the primary and then attempt switch/failover to the standby the operation won't succeed:

DGMGRL for Linux: Release 21.0.0.0.0 - Production on Mon Jun 24 10:11:09 2024
Version 21.13.0.0.0

Copyright (c) 1982, 2021, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "DB12"
Connected as SYSDBA.
DGMGRL> show configuration

Configuration - dg_config

  Protection Mode: MaxAvailability
  Members:
  db11 - Primary database
    db12 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 60 seconds ago)

DGMGRL> show configuration

Configuration - dg_config

  Protection Mode: MaxAvailability
  Members:
  db11 - Primary database
    Error: ORA-12541: TNS:no listener

    db12 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
ERROR   (status updated 0 seconds ago)

DGMGRL> failover to db12
ORA-16600: not connected to target standby database

DGMGRL> switchover to db12
2024-06-24T10:14:57.686+00:00
Performing switchover NOW, please wait...

Error: ORA-12541: TNS:no listener
Error: ORA-16625: cannot reach member "db11"

Failed.
2024-06-24T10:14:59.729+00:00
Unable to switchover, primary database is still "db11"

DGMGRL>