scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
340 stars 175 forks source link

Fix nonreliable method of validating shardawareness in e2e's #2175

Closed zimnx closed 3 weeks ago

zimnx commented 3 weeks ago

Gocql doesn't expose any way of checking if shardaware ports are used successfully, hence I implemented simple driver within a test, able to send initial packet and reading to which shard connection was established. Connections are established from within the client Pod which better reflects how clients are connecting in real environments.

Requires:

Fixes #1028

tnozicka commented 3 weeks ago

Can you elaborate on why https://github.com/scylladb/scylla-operator/blob/027d4b3d44e37272970572e560bcec507e5db814/vendor/github.com/gocql/gocql/scylla.go#L784-L796 is not enough? (failing for 0, or failing to dial)

zimnx commented 3 weeks ago

is not enough? (failing for 0, or failing to dial)

You will dial successfully to shard aware port, but how do you know if you actually got the right shard without parsing the protocol? Gocql driver doesn't allow you to check it. When there's NAT in between client and server, the only side effect is log message which cannot be validated.

Current approach used injected Dialer, but it can't determine if correct shard was chosen so it depended on particular gocql behavior - first connection is to any shard and the rest should use shard aware range. But sometimes driver didn't try to establish connections to all shards within test timeout. Why? I don't know, but relying on it isn't the best idea. We can do better by sending frames on our own and checking how shards are assigned.

zimnx commented 3 weeks ago

E2E's are failing because image is missing netcat. Once it's there I will retrigger them. Local runs with image having it are passing.

rzetelskik commented 3 weeks ago

/lgtm

zimnx commented 3 weeks ago

https://github.com/scylladb/scylla-operator-images/pull/76 landed /retest

zimnx commented 3 weeks ago

Bumped node-setup image to 0.0.3

zimnx commented 3 weeks ago

One minor change was made to remove -x from client Pod to prevent printing commands to stderr which is later validated and expected to be empty. There's no benefit of having this print. Expanded command is available in e2e dump. @rzetelskik ptal

scylla-operator-bot[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/scylladb/scylla-operator/blob/master/OWNERS)~~ [tnozicka,zimnx] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment