thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Investigate having Reaper display hostnames instead of IP addresses in Kubernetes deployments #871

Open jsanda opened 4 years ago

jsanda commented 4 years ago

Project board link

This stems from some discussion in #813.

In Kubernetes Cassandra is typically deployed as a StatefulSet. The StatefulSet controller provides stable network IDs for pods in a StatefulSet. For example the C* pod names in a StatefulSet might look like this:

Those pod names are also the hostnames. Reaper reports IP addresses and not hostnames. This makes debugging a bit more difficult as you have to map the IP address to the pod hostname.

I am not sure whether or not this is something that can be addressed in Reaper. As I mentioned in #813, in some of my testing Cassandra reports the IP addresses:

$ kubectl exec -it cluster-2-dc1-rack1-2 nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.16.1.3   289.83 KiB  256          66.0%             bb82e5a5-007b-4638-92c3-61f6abb52d6a  rack1
UN  10.16.0.6   277.21 KiB  256          67.1%             94555840-ab60-4478-a871-de8f40db1e66  rack1
UN  10.16.1.10  271.41 KiB  256          66.8%             6c603af3-30d2-44d5-b376-f4b57969ed11  rack1

Lastly, I want to mention that even if the necessary changes have to happen outside of Reaper, like in the pod configuration, then we can at least capture and document it here.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-138

gaborauth commented 2 years ago

Edit: Oops, maybe it is related to #870 instead of this.

Put my two cents in about the issue...

Before rollout restart of statefulset:

UN  192.168.133.151  599.77 KiB  16      77.4%             91ae03de-e968-40ea-b640-f054a7a90b5c  Rack01
UN  192.168.99.16    549.8 KiB   16      74.7%             5e668d33-bbcc-43f8-814f-2244a51a89a5  Rack01
UN  192.168.37.47    630.54 KiB  16      73.2%             31c7d3fb-e12f-40e4-b208-4fa9e046158f  Rack01
UN  192.168.111.88   521.82 KiB  16      74.7%             709a5a8a-7004-4a1a-b857-bba2e2362ffa  Rack01

cassandra-dc01-0     1/1     Running       0          64m     192.168.133.151   dc01-krts03   <none>           <none>
cassandra-dc01-1     1/1     Running       0          65m     192.168.99.16     dc01-krts01   <none>           <none>
cassandra-dc01-2     1/1     Running       0          60s     192.168.111.88    dc01-krts04   <none>           <none>
cassandra-dc01-3     1/1     Running       0          2m7s    192.168.37.47     dc01-krts02   <none>           <none>

The Reaper started a repair... and a rollout restart of the statefulset changes the IPs:

UN  192.168.133.153  599.77 KiB  16      77.4%             91ae03de-e968-40ea-b640-f054a7a90b5c  Rack01
UN  192.168.99.18    549.8 KiB   16      74.7%             5e668d33-bbcc-43f8-814f-2244a51a89a5  Rack01
UN  192.168.37.48    630.54 KiB  16      73.2%             31c7d3fb-e12f-40e4-b208-4fa9e046158f  Rack01
UN  192.168.111.89   521.82 KiB  16      74.7%             709a5a8a-7004-4a1a-b857-bba2e2362ffa  Rack01

cassandra-dc01-0     1/1     Running   0          17m     192.168.133.153   dc01-krts03   <none>           <none>
cassandra-dc01-1     1/1     Running   0          18m     192.168.99.18     dc01-krts01   <none>           <none>
cassandra-dc01-2     1/1     Running   0          19m     192.168.111.89    dc01-krts04   <none>           <none>
cassandra-dc01-3     1/1     Running   0          20m     192.168.37.48     dc01-krts02   <none>           <none>

So, the IPs changed, update the cluster database:

> select name, seed_hosts from cassandra_reaper.cluster where name='cluster';

 name    | seed_hosts
---------+-------------------------------------------------------------------------
 cluster | {'192.168.111.87', '192.168.133.151', '192.168.37.46', '192.168.99.16'}

(1 rows)
> update cassandra_reaper.cluster set seed_hosts={'192.168.111.89', '192.168.133.153', '192.168.37.48', '192.168.99.18'} where name='cluster';
> select name, seed_hosts from cassandra_reaper.cluster where name='cluster';

 name    | seed_hosts
---------+-------------------------------------------------------------------------
 cluster | {'192.168.111.89', '192.168.133.153', '192.168.37.48', '192.168.99.18'}

(1 rows)

And update all repair_run affected, I updated only one row in this example:

> select segment_id, replicas from cassandra_reaper.repair_run where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 limit 1;

 segment_id                           | replicas
--------------------------------------+------------------------------------------------------------------------------
 0ef8de90-c08f-11ec-88a8-45b15b5d8dc9 | {'192.168.111.87': 'DC01', '192.168.37.46': 'DC01', '192.168.99.16': 'DC01'}

(1 rows)
> update cassandra_reaper.repair_run set replicas={'192.168.111.89': 'DC01', '192.168.37.48': 'DC01', '192.168.99.18': 'DC01'} where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 and segment_id=0ef8de90-c08f-11ec-88a8-45b15b5d8dc9;
> select segment_id, replicas from cassandra_reaper.repair_run where id=0ef89070-c08f-11ec-88a8-45b15b5d8dc9 limit 1;

 segment_id                           | replicas
--------------------------------------+------------------------------------------------------------------------------
 0ef8de90-c08f-11ec-88a8-45b15b5d8dc9 | {'192.168.111.89': 'DC01', '192.168.37.48': 'DC01', '192.168.99.18': 'DC01'}

(1 rows)

...and everything is back to normal. I created a little script to fix it on my K8s cluster.