ovsdb server and ovsdb client data inconsistency to cause `\"where\" clause test failed`

gujun4990 commented 11 months ago

Environment

ovsdb-server version: 2.16.2
ovs python client: 3.1.2

ovsdb-server have 3 replicas to construct a cluster.

Description

We run neutron-ovn-metadata-agent service and it use ovs python client to communicate with ovsdb-server, client use leader_only=True to connect ovsdb-server. When ovsdb-server's leader to change, client will reconnect to new leader. After connect successfully, we found operation that client send transact method to server is failure (occasional problem). Server return \"where\" clause test failed. We analyzed and found that ovsdb-server and ovsdb client data are inconsistency.

transact method's normal data

{"id":51099,"method":"transact","params":["OVN_Southbound",{"columns":["external_ids"],"op":"wait","rows":[{"external_ids":["map",[["neutron:ovn-metadata-id","22cc4db2-b4e9-5f08-90d2-2e41bde79024"],["neutron:ovn-metadata-sb-cfg","660331"]]]}],"table":"Chassis_Private","timeout":0,"until":"==","where":[["_uuid","==",["uuid","7baaa5c0-3080-4c55-98f5-783b19ba4d93"]]]},{"op":"update","row":{"external_ids":["map",[["neutron:ovn-metadata-id","22cc4db2-b4e9-5f08-90d2-2e41bde79024"],["neutron:ovn-metadata-sb-cfg","660332"]]]},"table":"Chassis_Private","where":[["_uuid","==",["uuid","7baaa5c0-3080-4c55-98f5-783b19ba4d93"]]]}]}

transact method's abnormal data

{"id":51099,"method":"transact","params":["OVN_Southbound",{"columns":["external_ids"],"op":"wait","rows":[{"external_ids":["map",[["neutron:ovn-metadata-id","22cc4db2-b4e9-5f08-90d2-2e41bde79024"]]]}],"table":"Chassis_Private","timeout":0,"until":"==","where":[["_uuid","==",["uuid","7baaa5c0-3080-4c55-98f5-783b19ba4d93"]]]},{"op":"update","row":{"external_ids":["map",[["neutron:ovn-metadata-id","22cc4db2-b4e9-5f08-90d2-2e41bde79024"],["neutron:ovn-metadata-sb-cfg","660332"]]]},"table":"Chassis_Private","where":[["_uuid","==",["uuid","7baaa5c0-3080-4c55-98f5-783b19ba4d93"]]]}]}

Compare above normal and abnormal data for transact method, abnormal data are lack of ["neutron:ovn-metadata-sb-cfg","660331"]. But ovsdb-server is existed the data, so ovsdb-server return \"where\" clause test failed. Whether there's the situation that ovsdb client receive two same update3 notification when leader change? So after ovsdb client to run diff function, delete ["neutron:ovn-metadata-sb-cfg","660331"] data from client's memory.

igsilya commented 10 months ago

Hi, @gujun4990 . The server shouldn't send the same update twice. I don't know of a case where that may happen.

I found an issue though in the python-idl clinet code that might cause client to request the same change again in a rare case, but it will require the client to re-connect twice between sending these two transactions. By re-connect here I mean a full re-connection to a new leader with the successful monitor request/reply sequence, i.e. we should have two full leadership transfers and two re-connections to these leaders in a relatively short amount of time.

Do you see something like that in your setup when the issue is happening?

gujun4990 commented 10 months ago

@igsilya , Thanks for your reply, I have two questions:

As we use ovsdbapp's version is 1.15.0, existed below code(ovsdbapp/backend/ovs_idl/transaction.py:do_commit):

        if status == txn.TRY_AGAIN:
            LOG.debug("OVSDB transaction returned TRY_AGAIN, retrying")
            # In the case that there is a reconnection after
            # Connection.run() calls self.idl.run() but before do_commit()
            # is called, commit_block() can loop w/o calling idl.run()
            # which does the reconnect logic. It will then always return
            # TRY_AGAIN until we time out and Connection.run() calls
            # idl.run() again. So, call idl.run() here just in case.
            self.api.idl.run()

            # In the event that there is an issue with the txn or the db
            # is down, don't spam new txns as fast as we can
            time.sleep(min(2 ** retries, self.time_remaining(), MAX_SLEEP))
            retries += 1
            continue

I review the ovsdbapp's patches and found a new patch to drop these codes. So I wonder these codes have a greater chance to cause the problem?

If python-idl client re-connect ovsdb-server, the client's previous data in memory is all covered by new data from ovsdb-server , or use diff function to compare the new and old data?

igsilya commented 10 months ago

Could be. The code in ovsdbapp prior that commit was clearly wrong, so you should consider updating it anyway.
It depends on a version you're using. Assuming you're using python-idl 2.17+, default monitor type will be monitor_cond_since that will receive the difference between a previous and a new data. Before 2.17, content of the database would have been overwritten by a reply.

igsilya commented 10 months ago

FWIW, I posted a patch for the issue I found that may or may not be related to your issue. In case you want to try it: https://patchwork.ozlabs.org/project/openvswitch/patch/20230909021913.929606-1-i.maximets@ovn.org/

gujun4990 commented 10 months ago

Thanks a lot, I have had a try. Currently, the issue has not been reproduced again. But not sure this is successful, because the issue is occasional. We wish community can merge as soon as possible, so that we can keep watching to see if the issue will reproduce again. Thanks again.

gujun4990 commented 10 months ago

@igsilya, I observed that the patch you submit have been merged. Thank you for your patch. I will close the issue.

igsilya commented 10 months ago

Yeah, I applied the change yesterday. I'm still not sure if that's the issue you're facing, because it requires double re-connection in order to cause issues. Feel free to reopen in case you'll reproduce it again.

openvswitch / ovs-issues