Only add host if endpoint is not already present

sylwiaszunejko commented 5 months ago

In d735957 there was a functionality added to reresolve hostnames when all hosts are unreachable. In such a scenario, the driver will try to save the situation by reresolving the contact points in case there was a DNS name change.

However, if there was no address change, this results in creation of a new Host for every host specified in contact_points. It has new randomly assigned host_id, but the same endpoint. This duplicate host starts new reconnection process. Those duplicate reconnection processes can make the situation worse when the driver regains connectivity with the cluster. Different Hosts with the same endpoint are reconnecting in different moments and this cause host to be unreachable in unpredictable moments.

This PR introduces checking if the resolved endpoint is already present in Cluster Metadata information. The new host is added only if this condition is not true.

Fixes: #295

Refs: https://github.com/scylladb/scylladb/issues/16709, https://github.com/scylladb/scylladb/issues/17353

kbr-scylla commented 5 months ago

Fixes: https://github.com/scylladb/python-driver/issues/295, https://github.com/scylladb/scylladb/issues/16709

This syntax doesn't work, each closed issue has to have its own Fixes: line for GitHub automation to work

However, merging this won't fix https://github.com/scylladb/scylladb/issues/16709. To fix https://github.com/scylladb/scylladb/issues/16709 we need to

merge this PR
make a driver release
update submodule in Scylla repo. The submodule update commit will have the appropriate Fixes line.

Also, IIUC this process will also fix https://github.com/scylladb/scylladb/issues/17353?

Also, did you manage to reproduce using the method described in https://github.com/scylladb/scylladb/issues/17353? And verify that it no longer reproduces with this PR?

Also, how did you verify that this fixes https://github.com/scylladb/scylladb/issues/16709? Did you manage to reproduce https://github.com/scylladb/scylladb/issues/16709 and verify that it no longer reproduces after this fix?

avelanarius commented 5 months ago

Fixes: https://github.com/scylladb/python-driver/issues/295, https://github.com/scylladb/scylladb/issues/16709

Please also add references to those issues to the commit message.

avikivity commented 5 months ago

Q (semi-related):

If all hosts were unreachable, then new hosts could have been added or removed and we would not have known about it, as we wouldn't get an event from any connection.

Do we re-read the host list? For that matter, the schema could have changed too.

In fact it can happen without us knowing that we lost connectivity:

connected to a two-node cluster
one connection goes down, we try to re-establish it
while that's in progress, the second server dies, but we don't notice
the cluster adds 19 new nodes and changes 72 different tables' schema
we reconnect to the first server; we don't re-read metadata since we think we're connected to the second server
we notice the connection to the second server is down, and reconnect

So, it's possible to miss events without any indication. We'll need to figure out a way to fix that.

sylwiaszunejko commented 5 months ago

@kbr-scylla

Also, IIUC this process will also fix https://github.com/scylladb/scylladb/issues/17353?

yes

Also, did you manage to reproduce using the method described in https://github.com/scylladb/scylladb/issues/17353? And verify that it no longer reproduces with this PR?

yes

Also, how did you verify that this fixes https://github.com/scylladb/scylladb/issues/16709? Did you manage to reproduce https://github.com/scylladb/scylladb/issues/16709 and verify that it no longer reproduces after this fix?

I reproduced it using test_distributed_count_all from dtests (described in more details in my comments to this issue), it does no longer reproduce after this fix.

mykaul commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

sylwiaszunejko commented 5 months ago

Fixes: #295, scylladb/scylladb#16709

Please also add references to those issues to the commit message.

done

sylwiaszunejko commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Lorak-mmk commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Can you point me to that code? I thought we only did that in control connection.

avikivity commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Can you point me to that code? I thought we only did that in control connection.

It's enough if we do that on every loss of a control connection (provided we keep a control connection to every host, which I don't know if we do).

Lorak-mmk commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Can you point me to that code? I thought we only did that in control connection.

It's enough if we do that on every loss of a control connection (provided we keep a control connection to every host, which I don't know if we do).

There is one control conntection per session, not per host, afaik

avikivity commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Can you point me to that code? I thought we only did that in control connection.

It's enough if we do that on every loss of a control connection (provided we keep a control connection to every host, which I don't know if we do).

There is one control conntection per session, not per host, afaik

Hmm. Then if the host we're connected to dies, we don't receive updates.

Do we have keepalives on that connection? tcp or application level?

We need some kind of driver handbook to guide driver maintainers.

sylwiaszunejko commented 5 months ago

Aren't we re-reading the system.peers on every re-connection?

we are

Can you point me to that code? I thought we only did that in control connection.

Maybe I misunderstood something. We did it in control connection, but through control connection we do that for every host - https://github.com/scylladb/python-driver/blob/6b82872ef0bbd1fc9cf78fb069a7d7821e3a6dde/cassandra/cluster.py#L3624 here we call _try_connect and there is re-reading system.peers (https://github.com/scylladb/python-driver/blob/6b82872ef0bbd1fc9cf78fb069a7d7821e3a6dde/cassandra/cluster.py#L3663)

kbr-scylla commented 5 months ago

Fixes: https://github.com/scylladb/python-driver/issues/295

I don't think it actually fixes the issue (which originally reproduced on a multi-node cluster before the DNS resolve regression was introduced): details in this comment https://github.com/scylladb/python-driver/issues/295#issuecomment-1971221129

avelanarius commented 5 months ago

@kbr-scylla OK, I'll reopen #295

scylladb / python-driver

Only add host if endpoint is not already present #304