Open jsanda opened 4 years ago
Edit: Oops, maybe it is related to #870 instead of this.
Put my two cents in about the issue...
Before rollout restart
of statefulset:
UN 192.168.133.151 599.77 KiB 16 77.4% 91ae03de-e968-40ea-b640-f054a7a90b5c Rack01
UN 192.168.99.16 549.8 KiB 16 74.7% 5e668d33-bbcc-43f8-814f-2244a51a89a5 Rack01
UN 192.168.37.47 630.54 KiB 16 73.2% 31c7d3fb-e12f-40e4-b208-4fa9e046158f Rack01
UN 192.168.111.88 521.82 KiB 16 74.7% 709a5a8a-7004-4a1a-b857-bba2e2362ffa Rack01
cassandra-dc01-0 1/1 Running 0 64m 192.168.133.151 dc01-krts03 <none> <none>
cassandra-dc01-1 1/1 Running 0 65m 192.168.99.16 dc01-krts01 <none> <none>
cassandra-dc01-2 1/1 Running 0 60s 192.168.111.88 dc01-krts04 <none> <none>
cassandra-dc01-3 1/1 Running 0 2m7s 192.168.37.47 dc01-krts02 <none> <none>
The Reaper started a repair... and a rollout restart
of the statefulset changes the IPs:
UN 192.168.133.153 599.77 KiB 16 77.4% 91ae03de-e968-40ea-b640-f054a7a90b5c Rack01
UN 192.168.99.18 549.8 KiB 16 74.7% 5e668d33-bbcc-43f8-814f-2244a51a89a5 Rack01
UN 192.168.37.48 630.54 KiB 16 73.2% 31c7d3fb-e12f-40e4-b208-4fa9e046158f Rack01
UN 192.168.111.89 521.82 KiB 16 74.7% 709a5a8a-7004-4a1a-b857-bba2e2362ffa Rack01
cassandra-dc01-0 1/1 Running 0 17m 192.168.133.153 dc01-krts03 <none> <none>
cassandra-dc01-1 1/1 Running 0 18m 192.168.99.18 dc01-krts01 <none> <none>
cassandra-dc01-2 1/1 Running 0 19m 192.168.111.89 dc01-krts04 <none> <none>
cassandra-dc01-3 1/1 Running 0 20m 192.168.37.48 dc01-krts02 <none> <none>
So, the IPs changed, update the cluster
database:
> select name, seed_hosts from cassandra_reaper.cluster where name='cluster';
name | seed_hosts
---------+-------------------------------------------------------------------------
cluster | {'192.168.111.87', '192.168.133.151', '192.168.37.46', '192.168.99.16'}
(1 rows)
> update cassandra_reaper.cluster set seed_hosts={'192.168.111.89', '192.168.133.153', '192.168.37.48', '192.168.99.18'} where name='cluster';
> select name, seed_hosts from cassandra_reaper.cluster where name='cluster';
name | seed_hosts
---------+-------------------------------------------------------------------------
cluster | {'192.168.111.89', '192.168.133.153', '192.168.37.48', '192.168.99.18'}
(1 rows)
And update all repair_run
affected, I updated only one row in this example:
> select segment_id, replicas from cassandra_reaper.repair_run where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 limit 1;
segment_id | replicas
--------------------------------------+------------------------------------------------------------------------------
0ef8de90-c08f-11ec-88a8-45b15b5d8dc9 | {'192.168.111.87': 'DC01', '192.168.37.46': 'DC01', '192.168.99.16': 'DC01'}
(1 rows)
> update cassandra_reaper.repair_run set replicas={'192.168.111.89': 'DC01', '192.168.37.48': 'DC01', '192.168.99.18': 'DC01'} where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 and segment_id=0ef8de90-c08f-11ec-88a8-45b15b5d8dc9;
> select segment_id, replicas from cassandra_reaper.repair_run where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 limit 1;
segment_id | replicas
--------------------------------------+------------------------------------------------------------------------------
0ef8de90-c08f-11ec-88a8-45b15b5d8dc9 | {'192.168.111.89': 'DC01', '192.168.37.48': 'DC01', '192.168.99.18': 'DC01'}
(1 rows)
...and everything is back to normal. I created a little script to fix it on my K8s cluster.
Project board link
This stems from some discussion in #813.
In Kubernetes Cassandra is typically deployed as a StatefulSet. The StatefulSet controller provides stable network IDs for pods in a StatefulSet. For example the C* pod names in a StatefulSet might look like this:
Those pod names are also the hostnames. Reaper reports IP addresses and not hostnames. This makes debugging a bit more difficult as you have to map the IP address to the pod hostname.
I am not sure whether or not this is something that can be addressed in Reaper. As I mentioned in #813, in some of my testing Cassandra reports the IP addresses:
Lastly, I want to mention that even if the necessary changes have to happen outside of Reaper, like in the pod configuration, then we can at least capture and document it here.
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-138