thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Reaper 0.6.1 Can't connect to Cassandra seed node #139

Closed elsmorian closed 7 years ago

elsmorian commented 7 years ago

I'm running Reaper 0.6.1 from a Docker container, but when adding in a cluster from the web interface it fails with Failed to establish JMX connection to 172.29.16.15:7199 then further on down the trace it mentions Caused by: java.rmi.ConnectException: Connection refused to host: 127.0.1.1; nested exception is:.

I have checked connectivity from the Docker host to my Cassandra cluster and can telnet to the seed IP I'm giving it on port 7199 no problem, not sure why it's trying to connect to 127.0.1.1 - have I missed some configuration somewhere?

joaquincasares commented 7 years ago

Hello @elsmorian ,

Perhaps Cassandra is in LOCAL_JMX mode?

In this sample Docker Compose setup, which should be committed in the next couple of days, if not today, I had to mount these two files:

https://github.com/thelastpickle/cassandra-reaper/pull/136/files#diff-4e5e90c6228fd48698d074241c2ba760R16:

./docker/cassandra/jmxremote.access:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/management/jmxremote.access
./docker/cassandra/jmxremote.password:/etc/cassandra/jmxremote.password

The contents of both of those files can be found here:

https://github.com/thelastpickle/cassandra-reaper/tree/129c2f9a6676a74fb91adbfd95ab81c72a8e1b16/docker/cassandra

I also had to set this:

https://github.com/thelastpickle/cassandra-reaper/pull/136/files#diff-de6869c9003123a8827ef8b2df84b0e6R24

LOCAL_JMX=no

You can read more on why those settings were needed here:

https://github.com/thelastpickle/cassandra-reaper/blob/129c2f9a6676a74fb91adbfd95ab81c72a8e1b16/docker/README.md#cassandra

Do let us know if that still doesn't fix your issue. Cheers!

elsmorian commented 7 years ago

Hi @joaquincasares thanks for the very helpful reply! I had done the LOCAL_JMX=no bits, but I had not added in the reaper user to the jmxremote.access file, thanks for the information!

Now we are up and running, just a quick Q - we are running Cassandra 2.2.8 with the default number of vnodes (256?), what number is a good starting point for the Segment Count for our first repairs via Reaper?

joaquincasares commented 7 years ago

@elsmorian , that's great to hear! :)

A quick note: We've been recommending 32 vnodes lately to some of our larger customers in an attempt to simplify internal Cassandra operations while still allowing for the benefits that vnodes provide. However, do note that you cannot change vnodes on a running data center. Instead, if you were experiencing issues with 256 vnodes per node, you'd want to spin up a new data center with less vnodes, migrate to the new data center, then decommission the old data center with 256 vnodes.

As far as Segment Counts, I believe we would need a bit more info like:

elsmorian commented 7 years ago

@joaquincasares Many thanks for your response. What such issues would you attribute to having 256 nodes?

Currently we are running a slightly odd setup- 7 nodes with RF=3 in one 'production' datacenter and a single node in another datacenter we use for testing etc. Data load in the first DC is between 612.8 GB and 766.62 GB, and the single node in the other DC has a load of 1.32 TB

joaquincasares commented 7 years ago

@elsmorian Sure thing! :)

256 nodes will put a bit more complexity around all streaming related tasks since you now have to stream 256 * num_of_nodes different streams.

Thanks for the info! @adejanovski / @michaelsembwever do you have a recommendation for his ideal Segment Count based on the above numbers?

adejanovski commented 7 years ago

The number of segment will vary for each cluster. The best starting point IMHO is to have one segment per vnode in the cluster (1792 in your case), since it's the lowest number you'll be able to run : one segment can never overlap on multiple vnodes. Set a higher value only if segments fail to be repaired within 30 minutes.

You could easily see why having that many vnodes will force to have a lot of segments, each having an overhead. For example, there's the pause time related to intensity (which will by default be at least 30s), that multiplied by the number of segments gives 15h at least. 7 nodes with a RF of 3 will allow 2 segments to be repaired at the same time, bringing the overall overhead to 7h30. Using 32 vnodes would bring that down to ~1h50.

By the way, the minimum pause time (which is the repair manager loop cycle) can be modified in the yaml config file : repairManagerSchedulingIntervalSeconds: 30

If segments are very fast to process, it could be interesting to reduce the value to 10s for example.

Closing the issue, and do not hesitate to discuss this on the Reaper ML.

elsmorian commented 7 years ago

Thank you both @joaquincasares @adejanovski for some excellent advice here, really helpful!