openvswitch / ovs-issues

Issue tracker repo for Open vSwitch
10 stars 3 forks source link

syntax error: Parsing raft header failed: Type mismatch for member 'prev_servers' #160

Closed greenpau closed 5 years ago

greenpau commented 6 years ago

@blp , seeing these logs ~ every 25 seconds:

2018-09-04T13:26:50.078Z|78043|reconnect|INFO|tcp:10.10.10.10:6641: connected
2018-09-04T13:26:50.080Z|78044|ovsdb_idl|INFO|tcp:10.10.10.10:6641: clustered database server is disconnected from cluster; trying another server
2018-09-04T13:27:14.088Z|78045|reconnect|INFO|tcp:10.10.10.10:6641: connected
2018-09-04T13:27:14.089Z|78046|ovsdb_idl|INFO|tcp:10.10.10.10:6641: clustered database server is disconnected from cluster; trying another server
2018-09-04T13:27:38.100Z|78047|reconnect|INFO|tcp:10.10.10.10:6641: connected
2018-09-04T13:27:38.101Z|78048|ovsdb_idl|INFO|tcp:10.10.10.10:6641: clustered database server is disconnected from cluster; trying another server
2018-09-04T13:28:02.108Z|78049|reconnect|INFO|tcp:10.10.10.10:6641: connected
2018-09-04T13:28:02.110Z|78050|ovsdb_idl|INFO|tcp:10.10.10.10:6641: clustered database server is disconnected from cluster; trying another server
2018-09-04T13:28:26.118Z|78051|reconnect|INFO|tcp:10.10.10.10:6641: connected
2018-09-04T13:28:26.120Z|78052|ovsdb_idl|INFO|tcp:10.10.10.10:6641: clustered database server is disconnected from cluster; trying another server

What is the process of troubleshooting this?

Last entry in /var/log/openvswitch/ovsdb-server-nb.log and /var/log/openvswitch/ovsdb-server-sb.log:

# tail /var/log/openvswitch/ovsdb-server-nb.log
2018-08-21T17:26:55.187Z|00101|jsonrpc|WARN|unix#23: receive error: Connection reset by peer
2018-08-21T17:26:55.187Z|00102|reconnect|WARN|unix#23: connection dropped (Connection reset by peer)
2018-08-21T18:35:22.013Z|00103|jsonrpc|WARN|unix#24: receive error: Connection reset by peer
2018-08-21T18:35:22.013Z|00104|reconnect|WARN|unix#24: connection dropped (Connection reset by peer)
2018-08-23T17:15:43.338Z|00105|raft|INFO|current entry eid a424e064-65d9-48d9-a47b-248fe7169099 does not match prerequisite 2f06867b-1269-412e-8e88-31dcf124b1e0
2018-08-23T17:15:43.338Z|00106|raft|INFO|current entry eid a424e064-65d9-48d9-a47b-248fe7169099 does not match prerequisite 2f06867b-1269-412e-8e88-31dcf124b1e0
2018-08-23T17:15:43.349Z|00107|raft|INFO|current entry eid ea1905a2-5182-4fa1-b57a-da3bd7991e09 does not match prerequisite a424e064-65d9-48d9-a47b-248fe7169099
2018-08-23T17:15:43.349Z|00108|raft|INFO|current entry eid ea1905a2-5182-4fa1-b57a-da3bd7991e09 does not match prerequisite a424e064-65d9-48d9-a47b-248fe7169099
2018-08-23T17:15:43.350Z|00109|raft|INFO|current entry eid ea1905a2-5182-4fa1-b57a-da3bd7991e09 does not match prerequisite a424e064-65d9-48d9-a47b-248fe7169099
2018-08-24T17:15:43.495Z|00002|daemon_unix(monitor)|ERR|1 crashes: pid 4517 died, killed (Bus error), restarting

# tail /var/log/openvswitch/ovsdb-server-sb.log
2018-08-21T17:26:45.693Z|00017|raft|INFO|current entry eid 26fdcaa7-3b0d-4253-b696-a835ce3c91be does not match prerequisite ab1310f3-3a53-4394-8ce6-edf84a766d7d in execute_command_request
2018-08-23T17:15:43.458Z|00018|raft|INFO|current entry eid ae6cc324-c58d-43fd-a5be-76799daf629b does not match prerequisite 95a98afb-3a66-4a12-ad07-b047d6f73dff
2018-08-23T17:15:43.459Z|00019|raft|INFO|current entry eid ae6cc324-c58d-43fd-a5be-76799daf629b does not match prerequisite 95a98afb-3a66-4a12-ad07-b047d6f73dff
2018-08-23T17:15:43.460Z|00020|raft|INFO|current entry eid ae6cc324-c58d-43fd-a5be-76799daf629b does not match prerequisite 95a98afb-3a66-4a12-ad07-b047d6f73dff
2018-08-23T17:15:43.461Z|00021|raft|INFO|current entry eid ae6cc324-c58d-43fd-a5be-76799daf629b does not match prerequisite 95a98afb-3a66-4a12-ad07-b047d6f73dff
2018-08-23T17:15:43.480Z|00022|raft|INFO|current entry eid 19b0c769-7fe5-4c35-b1c0-34dfcf1a4b1f does not match prerequisite b66859a2-900b-4346-9ec3-02ffe879ea77 in execute_command_request
2018-08-23T17:15:43.533Z|00023|raft|INFO|current entry eid 0b3a7135-938b-49cb-8050-3ee81ba08467 does not match prerequisite 19b0c769-7fe5-4c35-b1c0-34dfcf1a4b1f in execute_command_request
2018-08-23T17:15:43.539Z|00024|raft|INFO|current entry eid f20ab422-40ce-4e6d-8593-b2fe8eb0b54b does not match prerequisite 0b3a7135-938b-49cb-8050-3ee81ba08467 in execute_command_request
2018-08-23T17:15:43.560Z|00025|raft|INFO|current entry eid 031ea8c1-5ef2-40e6-9f51-d6ae19f33579 does not match prerequisite 265f466c-4930-4a2a-ae1f-dd1c414c2b6f in execute_command_request
2018-08-24T17:15:52.731Z|00026|reconnect|ERR|tcp:127.0.0.1:46692: no response to inactivity probe after 5 seconds, disconnecting
greenpau commented 6 years ago

@blp , after a restart, the DB on a non-master did not come up and I see "_syntax error: Parsing raft header failed: Type mismatch for member 'prevservers'":

# systemctl status ovn-northd
● ovn-northd.service - OVN northd management daemon
   Loaded: loaded (/usr/lib/systemd/system/ovn-northd.service; enabled; vendor preset: disabled)
   Active: active (exited) since Tue 2018-09-04 10:21:41 EDT; 2min 16s ago
  Process: 27954 ExecStop=/usr/share/openvswitch/scripts/ovn-ctl stop_northd (code=exited, status=0/SUCCESS)
  Process: 28005 ExecStart=/usr/share/openvswitch/scripts/ovn-ctl start_northd $OVN_NORTHD_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 28005 (code=exited, status=0/SUCCESS)

Sep 04 10:20:41 ovn01 ovn-ctl[28005]: ovn-nbctl: unix:/run/openvswitch/ovnnb_db.sock: database connection failed (Connection refused)
Sep 04 10:21:11 ovn01 ovn-ctl[28005]: Waiting for OVN_Northbound to come up 2018-09-04T14:21:11Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
Sep 04 10:21:11 ovn01 ovn-ctl[28005]: /usr/share/openvswitch/scripts/ovs-lib: line 600: 28053 Alarm clock             "$@"
Sep 04 10:21:11 ovn01 ovn-ctl[28005]: [FAILED]
Sep 04 10:21:11 ovn01 ovsdb-server[28079]: ovs|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server-sb.log
Sep 04 10:21:11 ovn01 ovn-ctl[28005]: ovsdb-server: syntax "{"cluster_id":"0f4c5869-9bad-4b42-a603-8643832bb33d","local_address":"tcp:0.0.0.0:6644","name...ited","min ... 266"}": syntax error: Parsing raft header failed: Type mismatch for member 'prev_servers'.
Sep 04 10:21:11 ovn01 ovn-nbctl[28082]: ovs|00001|nbctl|INFO|Called as ovn-nbctl init
Sep 04 10:21:11 ovn01 ovn-nbctl[28082]: ovs|00002|db_ctl_base|ERR|unix:/run/openvswitch/ovnnb_db.sock: database connection failed (Connection refused)
Sep 04 10:21:11 ovn01 ovn-ctl[28005]: ovn-nbctl: unix:/run/openvswitch/ovnnb_db.sock: database connection failed (Connection refused)
Sep 04 10:21:41 ovn01 systemd[1]: Started OVN northd management daemon.
Hint: Some lines were ellipsized, use -l to show in full.
greenpau commented 6 years ago

@blp , raft related entry on non-leader is:

{
    "cluster_id": "0f4c5869-9bad-4b42-a603-8643832bb33d",
    "local_address": "tcp:0.0.0.0:6644",
    "name": "OVN_Southbound",
    "prev_eid": "7c510ec5-84be-4364-a1c1-929c12952537",
    "prev_index": 93,
    "prev_servers": null,
    "prev_term": 1,
    "server_id": "0b0fa006-de83-4768-98c2-2d53d9ae8266"
}

I think there is an expectation for prev_servers to NOT be null.

blp commented 5 years ago

I've spent some time trying to chase this down and I just can't come across any circumstances where "prev_servers" should end up as null. I see cases where it could be omitted, I see cases where it could be a {}-surrounded object, I can't understand null.

Do you have a reproduction case?

greenpau commented 5 years ago

@blp , not at the moment. If I see this again, I will post it here and reopen.