openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

Recover ovn-nb and ovn-sb from un-clustered status. #242

Closed luckydogxf closed 2 years ago

luckydogxf commented 2 years ago

I have three ovn cluster nodes, the first one is 172.16.200.147, the second one 172.16.200.133, and the third one was broken and just has been rebuilt with ip 172.16.200.63.

But cluster has no NB and SB leader, here are status of

############ 172.16.200.147 ##############

bc77
Name: OVN_Southbound
Cluster ID: a309 (a3090f2b-0246-4daa-a14e-40e369dd955e)
Server ID: bc77 (bc77a898-f5f1-4f4a-b049-f1926274d4f3)
Address: ssl:172.16.200.147:6644
Status: cluster member
Role: candidate
Term: 8272
Leader: unknown
Vote: self

Election timer: 4000
Log: [12918559, 12918559]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->7e27) ->b550 (->affb) (->f8be) <-b550 <-02e8
Servers:
    7e27 (7e27 at ssl:172.16.200.86:6644)
    bc77 (bc77 at ssl:172.16.200.147:6644) (self) (voted for bc77)
    b550 (b550 at ssl:172.16.200.133:6644) (voted for bc77)
    affb (affb at ssl:172.16.200.124:6644)
    f8be (f8be at ssl:172.16.200.85:6644)
ubuntu@juju-9c90cc-3-lxd-8:~$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
7313
Name: OVN_Northbound
Cluster ID: 5112 (5112af22-1ba3-47bd-a4c3-37c2df8dcd09)
Server ID: 7313 (7313bafb-0063-493b-ae7b-db64be063d3f)
Address: ssl:172.16.200.147:6643
Status: cluster member
Role: candidate
Term: 8282
Leader: unknown
Vote: self

Election timer: 4000
Log: [1051353, 1056017]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->0564) (->2a4d) ->00da (->fd3f) <-00da <-1a19
Servers:
    0564 (0564 at ssl:172.16.200.86:6643)
    2a4d (2a4d at ssl:172.16.200.85:6643)
    00da (00da at ssl:172.16.200.133:6643) (voted for 7313)
    fd3f (fd3f at ssl:172.16.200.124:6643)
    7313 (7313 at ssl:172.16.200.147:6643) (self) (voted for 7313)
172.16.200.133
b550
Name: OVN_Southbound
Cluster ID: a309 (a3090f2b-0246-4daa-a14e-40e369dd955e)
Server ID: b550 (b5505219-c7d1-4e8e-85c0-629ac4697c25)
Address: ssl:172.16.200.133:6644
Status: cluster member
Role: follower
Term: 8293
Leader: unknown
Vote: bc77

Election timer: 4000
Log: [12918559, 12918559]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->7e27) ->bc77 (->affb) (->f8be) <-02e8 <-bc77
Servers:
    7e27 (7e27 at ssl:172.16.200.86:6644)
    bc77 (bc77 at ssl:172.16.200.147:6644)
    affb (affb at ssl:172.16.200.124:6644)
    b550 (b550 at ssl:172.16.200.133:6644) (self)
    f8be (f8be at ssl:172.16.200.85:6644)
ubuntu@juju-9c90cc-9-lxd-7:~$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
00da
Name: OVN_Northbound
Cluster ID: 5112 (5112af22-1ba3-47bd-a4c3-37c2df8dcd09)
Server ID: 00da (00da7d4a-37a8-47ac-8ad2-f73f82ed1577)
Address: ssl:172.16.200.133:6643
Status: cluster member
Role: follower
Term: 8301
Leader: unknown
Vote: 7313

Election timer: 4000
Log: [1051353, 1056017]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->0564) (->2a4d) ->7313 (->fd3f) <-1a19 <-7313
Servers:
    0564 (0564 at ssl:172.16.200.86:6643)
    2a4d (2a4d at ssl:172.16.200.85:6643)
    00da (00da at ssl:172.16.200.133:6643) (self)
    7313 (7313 at ssl:172.16.200.147:6643)
    fd3f (fd3f at ssl:172.16.200.124:6643)
172.16.200.63, the newly build one
ubuntu@juju-9c90cc-16-lxd-10:~$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
1a19
Name: OVN_Northbound
Cluster ID: 5112 (5112af22-1ba3-47bd-a4c3-37c2df8dcd09)
Server ID: 1a19 (1a19858e-784d-413d-987f-dc78071f42ab)
Address: ssl:172.16.200.63:6643
Status: joining cluster
Remotes for joining: ssl:172.16.200.133:6643 ssl:172.16.200.147:6643 ssl:172.16.200.124:6643 ssl:172.16.200.86:6643 ssl:172.16.200.85:6643
Role: follower
Term: 0
Leader: unknown
Vote: unknown

Election timer: 1000
Log: [1, 1]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 ->0000 (->0000) (->0000) (->0000)
Servers:
ubuntu@juju-9c90cc-16-lxd-10:~$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
02e8
Name: OVN_Southbound
Cluster ID: a309 (a3090f2b-0246-4daa-a14e-40e369dd955e)
Server ID: 02e8 (02e8a8ff-37b8-41f5-8881-3d6d929cbadf)
Address: ssl:172.16.200.63:6644
Status: joining cluster
Remotes for joining: ssl:172.16.200.124:6644 ssl:172.16.200.133:6644 ssl:172.16.200.85:6644 ssl:172.16.200.86:6644 ssl:172.16.200.147:6644
Role: follower
Term: 0
Leader: unknown
Vote: unknown

Election timer: 1000
Log: [1, 1]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 ->0000 (->0000) (->0000) (->0000)
Servers:
### ps -ef of three nodes are almost the same. ####

root     2076926       1  0 08:52 ?        00:01:52 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-sb.log --remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/var/run/ovn/ovnsb_db.pid --unixctl=/var/run/ovn/ovnsb_db.ctl --remote=db:OVN_Southbound,SB_Global,connections --private-key=/etc/ovn/key_host --certificate=/etc/ovn/cert_host --ca-cert=/etc/ovn/ovn-central.crt --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /var/lib/ovn/ovnsb_db.db
root     2096151       1  1 09:38 ?        00:02:54 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --remote=db:OVN_Northbound,NB_Global,connections --private-key=/etc/ovn/key_host --certificate=/etc/ovn/cert_host --ca-cert=/etc/ovn/ovn-central.crt --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /var/lib/ovn/ovnnb_db.db

As we can see there are two unused nodes ( 200.86 and 200.85). 147 and 133 do not elect properly, so there is no leader for ovs-nb and ovs-sb database. And the newly built one isn't join the cluster

I try to use' sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound f8be failed to send removal request ovs-appctl: /var/run/ovn/ovnsb_db.ctl: server returned an error'

to kick out 200.85 and 200.86 on 200.133 and 200.147, they just don't work.

As we can see, the connection is wrong.

133 of NB/SB.

Status: cluster member
Role: candidate
Term: 17494
Leader: unknown
Vote: self

##

133 voted itself, and it becomes candiate, and Let's check 147

Status: cluster member
Role: follower
Term: 17486
Leader: unknown
Vote: b550
Election timer: 4000
Log: [12918559, 12918559]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->7e27) ->b550 (->affb) (->f8be) <-b550 <-02e8
Servers:
    7e27 (7e27 at ssl:172.16.200.86:6644)
    bc77 (bc77 at ssl:172.16.200.147:6644) (self)
    b550 (b550 at ssl:172.16.200.133:6644)
    affb (affb at ssl:172.16.200.124:6644)
    f8be (f8be at ssl:172.16.200.85:6644)

147 votes b550, which is 133, too. It becomes follower.

So the main problem is 'Connections', 147 and 133 try to connect 85 and 86 ( stale IPs). Somehow it makes the election void.

luckydogxf commented 2 years ago

sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound f8be # failed to send removal request ovs-appctl: /var/run/ovn/ovnsb_db.ctl: server returned an error' Guess it's because the cluster has no leader, so stale members cannot be kicked out.

igsilya commented 2 years ago

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

luckydogxf commented 2 years ago

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later.

It happens.

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

Currently I have three nodes, IP 147/133/69

200.147 OVN-* Config.

# /etc/default/ovn-central
OVN_CTL_OPTS= \
    --db-nb-cluster-local-addr=172.16.200.147 \
    --db-nb-cluster-local-port=6643 \
    --db-nb-cluster-local-proto=ssl \
    --ovn-nb-db-ssl-key=/etc/ovn/key_host \
    --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-nb-cluster-remote-addr=172.16.200.133 \
    --db-nb-cluster-remote-port=6643 \
    --db-nb-cluster-remote-proto=ssl \
    --db-sb-cluster-local-addr=172.16.200.147 \
    --db-sb-cluster-local-port=6644 \
    --db-sb-cluster-local-proto=ssl \
    --ovn-sb-db-ssl-key=/etc/ovn/key_host \
    --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-sb-cluster-remote-addr=172.16.200.133 \
    --db-sb-cluster-remote-port=6644 \
    --db-sb-cluster-remote-proto=ssl
### 200.133 ###
OVN_CTL_OPTS= \
    --db-nb-cluster-local-addr=172.16.200.133 \
    --db-nb-cluster-local-port=6643 \
    --db-nb-cluster-local-proto=ssl \
    --ovn-nb-db-ssl-key=/etc/ovn/key_host \
    --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-nb-cluster-remote-addr=172.16.200.147 \
    --db-nb-cluster-remote-port=6643 \
    --db-nb-cluster-remote-proto=ssl \
    --db-sb-cluster-local-addr=172.16.200.133 \
    --db-sb-cluster-local-port=6644 \
    --db-sb-cluster-local-proto=ssl \
    --ovn-sb-db-ssl-key=/etc/ovn/key_host \
    --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-sb-cluster-remote-addr=172.16.200.147 \
    --db-sb-cluster-remote-port=6644 \
    --db-sb-cluster-remote-proto=ssl
### 200.69 ###
OVN_CTL_OPTS= \
    --db-nb-cluster-local-addr=172.16.200.69 \
    --db-nb-cluster-local-port=6643 \
    --db-nb-cluster-local-proto=ssl \
    --ovn-nb-db-ssl-key=/etc/ovn/key_host \
    --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-nb-cluster-remote-addr=172.16.200.147 \
    --db-nb-cluster-remote-port=6643 \
    --db-nb-cluster-remote-proto=ssl \
    --db-sb-cluster-local-addr=172.16.200.69 \
    --db-sb-cluster-local-port=6644 \
    --db-sb-cluster-local-proto=ssl \
    --ovn-sb-db-ssl-key=/etc/ovn/key_host \
    --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \
    --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \
    --db-sb-cluster-remote-addr=172.16.200.147 \
    --db-sb-cluster-remote-port=6644 \
    --db-sb-cluster-remote-proto=ssl

Are we can see, 147 tries to connect to 133, and 133 tries to connect to 147, while 69 tries to connect to 147.

147 --->133
133 --> 147
69   ---> 147

I stopped all ovn- services, and removed /var/lib/ovn/ovn and /var/lib/ovn/.ovn. Then restarted all ovn- services in three nodes.

But the cluster wasn't established, please see below information.

# status of 147 ##
sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
f6fe
Name: OVN_Southbound
Cluster ID: not yet known
Server ID: f6fe (f6fe41de-2ac6-4a48-a748-b263dd4509d5)
Address: ssl:172.16.200.147:6644
Status: joining cluster
Remotes for joining: ssl:172.16.200.133:6644
Role: follower
Term: 0
Leader: unknown
Vote: unknown

Connections: ->3c7c <-3c7c
# status of  133 ###

Name: OVN_Southbound
Cluster ID: not yet known
Server ID: 3c7c (3c7cedef-2e85-49bf-bb04-5fffab7f2a58)
Address: ssl:172.16.200.133:6644
Status: joining cluster
Remotes for joining: ssl:172.16.200.147:6644
# satus of 69  ###

Name: OVN_Southbound
Cluster ID: not yet known
Server ID: ae8b (ae8b7cbf-f333-45a3-bb1c-819b75bd48dc)
Address: ssl:172.16.200.69:6644
Status: joining cluster
Remotes for joining: ssl:172.16.200.147:6644
Role: follower
Term: 0
Leader: unknown
Vote: unknown
====

So what's the correct way to bring them up again ? Thanks.

On Sat, Jan 22, 2022 at 9:56 AM luckydog xf @.***> wrote:

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

It seems there must exist an initial node in the cluster. Otherwise it becomes a chicken egg problem.

luckydogxf commented 2 years ago
  1. Stopped ovn- and removed /var/lib/ovn of 147.
  2. modify /etc/default/ovn-central

    200.147 OVN-* Config.

    /etc/default/ovn-central

    OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-nb-cluster-remote-addr=172.16.200.133 \

    --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-sb-cluster-remote-addr=172.16.200.133 \

    --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

Then started ovn- in 147, 133 and 69, cluster is back now. Use ovc-appctl cluster/kick to remove 147, stop ovn-, revert /etc/default/ovn-central of 147, remove /var/lib/ovn/*. Finally restart. Now 147 is completely back to normal.

@.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 5bc0 Name: OVN_Southbound Cluster ID: 9e33 (9e336acc-b7d6-4541-8ec2-16a6dc8133cf) Server ID: 5bc0 (5bc09c15-df7e-4629-9650-2e188a406864) Address: ssl:172.16.200.147:6644 Status: cluster member Role: follower Term: 2 Leader: acea Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->105b <-acea <-105b Servers: acea (acea at ssl:172.16.200.133:6644) 5bc0 (5bc0 at ssl:172.16.200.147:6644) (self) 105b (105b at ssl:172.16.200.69:6644) @.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 9707 Name: OVN_Northbound Cluster ID: 825d (825dda4c-b654-4e8c-8e15-a2648928efb9) Server ID: 9707 (97070892-50c2-483c-96e2-9998e3788f56) Address: ssl:172.16.200.147:6643 Status: cluster member Role: follower Term: 2 Leader: 112e Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->0000 <-112e <-954b Servers: 9707 (9707 at ssl:172.16.200.147:6643) (self) 112e (112e at ssl:172.16.200.69:6643) 954b (954b at ssl:172.16.200.133:6643)

On Sat, Jan 22, 2022 at 6:46 PM luckydog xf @.***> wrote:

Currently I have three nodes, IP 147/133/69

200.147 OVN-* Config.

/etc/default/ovn-central

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.133 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.133 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.133

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.133 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.133 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.69

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.69 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.69 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl


Are we can see, 147 tries to connect to 133, and 133 tries to connect to 147, while 69 tries to connect to 147.

147 --->133 133 --> 147 69 ---> 147

I stopped all ovn- services, and removed /var/lib/ovn/ovn and /var/lib/ovn/.ovn. Then restarted all ovn- services in three nodes.

But the cluster wasn't established, please see below information.

status of 147

sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound f6fe Name: OVN_Southbound Cluster ID: not yet known Server ID: f6fe (f6fe41de-2ac6-4a48-a748-b263dd4509d5) Address: ssl:172.16.200.147:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.133:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

Connections: ->3c7c <-3c7c

status of 133

Name: OVN_Southbound Cluster ID: not yet known Server ID: 3c7c (3c7cedef-2e85-49bf-bb04-5fffab7f2a58) Address: ssl:172.16.200.133:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644

satus of 69

Name: OVN_Southbound Cluster ID: not yet known Server ID: ae8b (ae8b7cbf-f333-45a3-bb1c-819b75bd48dc) Address: ssl:172.16.200.69:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

So what's the correct them to bring them up again ? Thanks.

On Sat, Jan 22, 2022 at 9:56 AM luckydog xf @.***> wrote:

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

However OVSDB SB and NB don't work.

root 7509 1 0 11:19 ? 00:00:00 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl:172.16.200.147:6641 ,ssl:172.16.200.133:6641,ssl:172.16.200.69:6641 --ovnsb-db=ssl: 172.16.200.147:16642,ssl:172.16.200.133:16642,ssl:172.16.200.69:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach

All are running on three nodes.

But 6641 and 6642 are not up.

Both are empty.

sudo netstat -lpan | grep 6641 sudo netstat -lpan | grep 6642

sudo ovn-sbctl get-connection # read-write role="" pssl:16642 read-write role="ovn-controller" pssl:6642

Turns out I need to allow connections.

However,

sudo ovn-sbctl set-connection role="" pssl:16642 if I add the second one, the first one would be overwritten.

Weird.

On Sat, Jan 22, 2022 at 7:29 PM luckydog xf @.***> wrote:

  1. Stopped ovn- and removed /var/lib/ovn of 147.
  2. modify /etc/default/ovn-central

    200.147 OVN-* Config.

    /etc/default/ovn-central

    OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-nb-cluster-remote-addr=172.16.200.133 \

    --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-sb-cluster-remote-addr=172.16.200.133 \

    --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

Then started ovn- in 147, 133 and 69, cluster is back now. Use ovc-appctl cluster/kick to remove 147, stop ovn-, revert /etc/default/ovn-central of 147, remove /var/lib/ovn/*. Finally restart. Now 147 is completely back to normal.

@.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 5bc0 Name: OVN_Southbound Cluster ID: 9e33 (9e336acc-b7d6-4541-8ec2-16a6dc8133cf) Server ID: 5bc0 (5bc09c15-df7e-4629-9650-2e188a406864) Address: ssl:172.16.200.147:6644 Status: cluster member Role: follower Term: 2 Leader: acea Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->105b <-acea <-105b Servers: acea (acea at ssl:172.16.200.133:6644) 5bc0 (5bc0 at ssl:172.16.200.147:6644) (self) 105b (105b at ssl:172.16.200.69:6644) @.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 9707 Name: OVN_Northbound Cluster ID: 825d (825dda4c-b654-4e8c-8e15-a2648928efb9) Server ID: 9707 (97070892-50c2-483c-96e2-9998e3788f56) Address: ssl:172.16.200.147:6643 Status: cluster member Role: follower Term: 2 Leader: 112e Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->0000 <-112e <-954b Servers: 9707 (9707 at ssl:172.16.200.147:6643) (self) 112e (112e at ssl:172.16.200.69:6643) 954b (954b at ssl:172.16.200.133:6643)

On Sat, Jan 22, 2022 at 6:46 PM luckydog xf @.***> wrote:

Currently I have three nodes, IP 147/133/69

200.147 OVN-* Config.

/etc/default/ovn-central

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.133 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.133 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.133

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.133 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.133 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.69

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.69 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.69 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl


Are we can see, 147 tries to connect to 133, and 133 tries to connect to 147, while 69 tries to connect to 147.

147 --->133 133 --> 147 69 ---> 147

I stopped all ovn- services, and removed /var/lib/ovn/ovn and /var/lib/ovn/.ovn. Then restarted all ovn- services in three nodes.

But the cluster wasn't established, please see below information.

status of 147

sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound f6fe Name: OVN_Southbound Cluster ID: not yet known Server ID: f6fe (f6fe41de-2ac6-4a48-a748-b263dd4509d5) Address: ssl:172.16.200.147:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.133:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

Connections: ->3c7c <-3c7c

status of 133

Name: OVN_Southbound Cluster ID: not yet known Server ID: 3c7c (3c7cedef-2e85-49bf-bb04-5fffab7f2a58) Address: ssl:172.16.200.133:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644

satus of 69

Name: OVN_Southbound Cluster ID: not yet known Server ID: ae8b (ae8b7cbf-f333-45a3-bb1c-819b75bd48dc) Address: ssl:172.16.200.69:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

So what's the correct them to bring them up again ? Thanks.

On Sat, Jan 22, 2022 at 9:56 AM luckydog xf @.***> wrote:

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets < @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

@.***:~$ sudo ovn-sbctl set-connection read-write role="" pssl:16642

@.***:~$ sudo ovn-sbctl set-connection read-write role="ovn-controller" pssl:6642

@.***:~$ sudo ovn-sbctl get-connection read-write role="ovn-controller" pssl:6642

weird. The first one is overwided.

On Sat, Jan 22, 2022 at 8:35 PM luckydog xf @.***> wrote:

However OVSDB SB and NB don't work.

root 7509 1 0 11:19 ? 00:00:00 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl:172.16.200.147:6641 ,ssl:172.16.200.133:6641,ssl:172.16.200.69:6641 --ovnsb-db=ssl: 172.16.200.147:16642,ssl:172.16.200.133:16642,ssl:172.16.200.69:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach

All are running on three nodes.

But 6641 and 6642 are not up.

Both are empty.

sudo netstat -lpan | grep 6641 sudo netstat -lpan | grep 6642

sudo ovn-sbctl get-connection # read-write role="" pssl:16642 read-write role="ovn-controller" pssl:6642

Just find that I need to allow connections.

However,

sudo ovn-sbctl set-connection role="" pssl:16642 if I add the second one, the first one would be override.

Weird.

On Sat, Jan 22, 2022 at 7:29 PM luckydog xf @.***> wrote:

  1. Stopped ovn- and removed /var/lib/ovn of 147.
  2. modify /etc/default/ovn-central

    200.147 OVN-* Config.

    /etc/default/ovn-central

    OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-nb-cluster-remote-addr=172.16.200.133 \

    --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-sb-cluster-remote-addr=172.16.200.133 \

    --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

Then started ovn- in 147, 133 and 69, cluster is back now. Use ovc-appctl cluster/kick to remove 147, stop ovn-, revert /etc/default/ovn-central of 147, remove /var/lib/ovn/*. Finally restart. Now 147 is completely back to normal.

@.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 5bc0 Name: OVN_Southbound Cluster ID: 9e33 (9e336acc-b7d6-4541-8ec2-16a6dc8133cf) Server ID: 5bc0 (5bc09c15-df7e-4629-9650-2e188a406864) Address: ssl:172.16.200.147:6644 Status: cluster member Role: follower Term: 2 Leader: acea Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->105b <-acea <-105b Servers: acea (acea at ssl:172.16.200.133:6644) 5bc0 (5bc0 at ssl:172.16.200.147:6644) (self) 105b (105b at ssl:172.16.200.69:6644) @.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 9707 Name: OVN_Northbound Cluster ID: 825d (825dda4c-b654-4e8c-8e15-a2648928efb9) Server ID: 9707 (97070892-50c2-483c-96e2-9998e3788f56) Address: ssl:172.16.200.147:6643 Status: cluster member Role: follower Term: 2 Leader: 112e Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->0000 <-112e <-954b Servers: 9707 (9707 at ssl:172.16.200.147:6643) (self) 112e (112e at ssl:172.16.200.69:6643) 954b (954b at ssl:172.16.200.133:6643)

On Sat, Jan 22, 2022 at 6:46 PM luckydog xf @.***> wrote:

Currently I have three nodes, IP 147/133/69

200.147 OVN-* Config.

/etc/default/ovn-central

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.133 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.133 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.133

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.133 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.133 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.69

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.69 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.69 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl


Are we can see, 147 tries to connect to 133, and 133 tries to connect to 147, while 69 tries to connect to 147.

147 --->133 133 --> 147 69 ---> 147

I stopped all ovn- services, and removed /var/lib/ovn/ovn and /var/lib/ovn/.ovn. Then restarted all ovn- services in three nodes.

But the cluster wasn't established, please see below information.

status of 147

sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound f6fe Name: OVN_Southbound Cluster ID: not yet known Server ID: f6fe (f6fe41de-2ac6-4a48-a748-b263dd4509d5) Address: ssl:172.16.200.147:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.133:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

Connections: ->3c7c <-3c7c

status of 133

Name: OVN_Southbound Cluster ID: not yet known Server ID: 3c7c (3c7cedef-2e85-49bf-bb04-5fffab7f2a58) Address: ssl:172.16.200.133:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644

satus of 69

Name: OVN_Southbound Cluster ID: not yet known Server ID: ae8b (ae8b7cbf-f333-45a3-bb1c-819b75bd48dc) Address: ssl:172.16.200.69:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

So what's the correct them to bring them up again ? Thanks.

On Sat, Jan 22, 2022 at 9:56 AM luckydog xf @.***> wrote:

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets < @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

Damn it, just run # sudo ovn-sbctl set-connection read-write role="" pssl:16642 read-write role="ovn-controller" pssl:6642

On Sat, Jan 22, 2022 at 9:00 PM luckydog xf @.***> wrote:

@.***:~$ sudo ovn-sbctl set-connection read-write role="" pssl:16642

@.***:~$ sudo ovn-sbctl set-connection read-write role="ovn-controller" pssl:6642

@.***:~$ sudo ovn-sbctl get-connection read-write role="ovn-controller" pssl:6642

weird. The first one is overwided.

On Sat, Jan 22, 2022 at 8:35 PM luckydog xf @.***> wrote:

However OVSDB SB and NB don't work.

root 7509 1 0 11:19 ? 00:00:00 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl: 172.16.200.147:6641,ssl:172.16.200.133:6641,ssl:172.16.200.69:6641 --ovnsb-db=ssl:172.16.200.147:16642,ssl:172.16.200.133:16642,ssl: 172.16.200.69:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach

All are running on three nodes.

But 6641 and 6642 are not up.

Both are empty.

sudo netstat -lpan | grep 6641 sudo netstat -lpan | grep 6642

sudo ovn-sbctl get-connection # read-write role="" pssl:16642 read-write role="ovn-controller" pssl:6642

Just find that I need to allow connections.

However,

sudo ovn-sbctl set-connection role="" pssl:16642 if I add the second one, the first one would be override.

Weird.

On Sat, Jan 22, 2022 at 7:29 PM luckydog xf @.***> wrote:

  1. Stopped ovn- and removed /var/lib/ovn of 147.
  2. modify /etc/default/ovn-central

    200.147 OVN-* Config.

    /etc/default/ovn-central

    OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-nb-cluster-remote-addr=172.16.200.133 \

    --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ ########### REMOVE BELOW LINE ############################

    --db-sb-cluster-remote-addr=172.16.200.133 \

    --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

Then started ovn- in 147, 133 and 69, cluster is back now. Use ovc-appctl cluster/kick to remove 147, stop ovn-, revert /etc/default/ovn-central of 147, remove /var/lib/ovn/*. Finally restart. Now 147 is completely back to normal.

@.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 5bc0 Name: OVN_Southbound Cluster ID: 9e33 (9e336acc-b7d6-4541-8ec2-16a6dc8133cf) Server ID: 5bc0 (5bc09c15-df7e-4629-9650-2e188a406864) Address: ssl:172.16.200.147:6644 Status: cluster member Role: follower Term: 2 Leader: acea Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->105b <-acea <-105b Servers: acea (acea at ssl:172.16.200.133:6644) 5bc0 (5bc0 at ssl:172.16.200.147:6644) (self) 105b (105b at ssl:172.16.200.69:6644) @.***:/var/lib/ovn$ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 9707 Name: OVN_Northbound Cluster ID: 825d (825dda4c-b654-4e8c-8e15-a2648928efb9) Server ID: 9707 (97070892-50c2-483c-96e2-9998e3788f56) Address: ssl:172.16.200.147:6643 Status: cluster member Role: follower Term: 2 Leader: 112e Vote: unknown

Election timer: 1000 Log: [2, 9] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 ->0000 <-112e <-954b Servers: 9707 (9707 at ssl:172.16.200.147:6643) (self) 112e (112e at ssl:172.16.200.69:6643) 954b (954b at ssl:172.16.200.133:6643)

On Sat, Jan 22, 2022 at 6:46 PM luckydog xf @.***> wrote:

Currently I have three nodes, IP 147/133/69

200.147 OVN-* Config.

/etc/default/ovn-central

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.147 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.133 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.147 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.133 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.133

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.133 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.133 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl

200.69

OVN_CTL_OPTS= \ --db-nb-cluster-local-addr=172.16.200.69 \ --db-nb-cluster-local-port=6643 \ --db-nb-cluster-local-proto=ssl \ --ovn-nb-db-ssl-key=/etc/ovn/key_host \ --ovn-nb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-nb-cluster-remote-addr=172.16.200.147 \ --db-nb-cluster-remote-port=6643 \ --db-nb-cluster-remote-proto=ssl \ --db-sb-cluster-local-addr=172.16.200.69 \ --db-sb-cluster-local-port=6644 \ --db-sb-cluster-local-proto=ssl \ --ovn-sb-db-ssl-key=/etc/ovn/key_host \ --ovn-sb-db-ssl-cert=/etc/ovn/cert_host \ --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-central.crt \ --db-sb-cluster-remote-addr=172.16.200.147 \ --db-sb-cluster-remote-port=6644 \ --db-sb-cluster-remote-proto=ssl


Are we can see, 147 tries to connect to 133, and 133 tries to connect to 147, while 69 tries to connect to 147.

147 --->133 133 --> 147 69 ---> 147

I stopped all ovn- services, and removed /var/lib/ovn/ovn and /var/lib/ovn/.ovn. Then restarted all ovn- services in three nodes.

But the cluster wasn't established, please see below information.

status of 147

sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound f6fe Name: OVN_Southbound Cluster ID: not yet known Server ID: f6fe (f6fe41de-2ac6-4a48-a748-b263dd4509d5) Address: ssl:172.16.200.147:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.133:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

Connections: ->3c7c <-3c7c

status of 133

Name: OVN_Southbound Cluster ID: not yet known Server ID: 3c7c (3c7cedef-2e85-49bf-bb04-5fffab7f2a58) Address: ssl:172.16.200.133:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644

satus of 69

Name: OVN_Southbound Cluster ID: not yet known Server ID: ae8b (ae8b7cbf-f333-45a3-bb1c-819b75bd48dc) Address: ssl:172.16.200.69:6644 Status: joining cluster Remotes for joining: ssl:172.16.200.147:6644 Role: follower Term: 0 Leader: unknown Vote: unknown

So what's the correct them to bring them up again ? Thanks.

On Sat, Jan 22, 2022 at 9:56 AM luckydog xf @.***> wrote:

Seems it's impossible to change OVSDB-SB and OVSDB-NB databases manually. I have to stop all ovn- services, remove /var/lib/ovn/.db and then start them.

On Sat, Jan 22, 2022 at 9:47 AM luckydog xf @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster? ----> I know this is strange, Let me explain what happened.

  1. One of three ovn nodes failed, we added a new one by juju, and checked it was unhealthy. we removed that node later( IP may be 85).
  2. We repeated #1 step twice , this time the IP may be 86 and 124.
  3. The three-newly-added nodes are removed later. .

On Sat, Jan 22, 2022 at 9:39 AM luckydog xf @.***> wrote:

Thanks for your reply. I read this docs( https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#clustered-database-service-model )

To add a server to a cluster, run ovsdb-tool join-cluster on the new server and start ovsdb-server. To remove a running server from a cluster, use ovs-appctl to invoke the cluster/leave command. When a server fails and cannot be recovered, e.g. because its hard disk crashed, or to otherwise remove a server that is down from a cluster, use ovs-appctl to invoke cluster/kick to make the remaining servers kick it out of the cluster.

The above methods for adding and removing servers only work for healthy clusters, that is, for clusters with no more failures than their maximum tolerance. For example, in a 3-server cluster, the failure of 2 servers prevents servers joining or leaving the cluster (as well as database access).


Just realize that I cannot use commands like cluster/kick to force stale IPs out because the cluster is unhealthy now.

85 and 86 are stale IP and cannot be back. Currently the cluster isn't established. So is there any way to modify ovs-db manually ? I have to rebuild the cluster from scratch if not.

It's an openstack OVN-* component, so I believe I can recover OVSDB-NB From Neutron DB and OVSDB-SB could be synced from compute nodes, right ?

On Fri, Jan 21, 2022 at 9:16 PM Ilya Maximets < @.***> wrote:

If you had only 3 nodes, how did you end up with 5 servers in the ovsdb cluster?

On the topic, you have a 5-server cluster now with only 2 of them being alive. Therefore, there is no quorum, hence there is no way to elect a leader, hence no way to add or delete servers or perform any other database modifications. You need to bring back to live one of the dead servers (not rebuild them), otherwise the cluster is effectively unrecoverable without manual changes in the database files, which would be very intrusive and hard to do correctly. If you can not bring up one of the dead servers, the easiest option would be to destroy the current cluster all together and re-build from scratch.

— Reply to this email directly, view it on GitHub https://github.com/openvswitch/ovs-issues/issues/242#issuecomment-1018495636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANT2BZXMXERKMGSTUD7D6ELUXFMBDANCNFSM5MOUYYAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

luckydogxf commented 2 years ago

The following thing is setting connection for NB(6641) and using neutron-ovn-db-sync-util to sync data from neutron database to OVN-NB, and OVN-SB would be populated soon after restarting ovn-* on compute nodes. Problem resolved. Thanks.

igsilya commented 2 years ago

@luckydogxf I'm glad you resolved the issue. And thank you for documenting the steps here, this might be really helpful for someone else looking around for solutions.

I guess, the bottom line here is: It's important to kick the dead server (with ovs-appctl cluster/kick) from the cluster, before replacing it.