Too many compactions and reads on one of the scylla nodes in a cluster compared to others.

gaurarpit12 commented 3 years ago

This is Scylla's bug tracker, to be used for reporting bugs only. If you have a question about Scylla, and not a bug, please ask it in our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

[] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details Scylla version (or git commit hash): **3.0.6** Cluster size: 5 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS 7

Hardware details (for performance issues) Delete if unneeded Platform (physical/VM/cloud instance type/docker): Physical Hardware: sockets=2 cores=28 hyperthreading=Enabled memory= 403GB Disks: (SSD/HDD, count) SSD, RAID1

There are several issues with this node:

Too many compactions happening on this node. See the following image showcasing a very high value for compaction queue length:

Number of foreground and background reads too much for this node. See the following image for the same:

Too high read-latency for this particular node compared to other nodes. Please have a look at the following image for the same:

We saw these issues earlier too, but reduced our data. But this time it's getting very painful. At the node level, seeing the errors of the kind shown below, and that too continuously:
storage_proxy - Exception when communicating with 10.4.106.23: seastar::semaphore_timed_out (Semaphore timedout)

I suspect it to be a bug with scylla-version 3.0.x, hence reporting it here.

dorlaor commented 3 years ago

The io queue length isn't the number of compactions but the length of the queue for compaction, it's related. From the pattern of latency and tasks, it seems like a process steals the cpu or the IO from Scylla. Are you running in a container or do you have other processes running on the machine? Scylla expects to be the only one on the cores it uses. Another possible reason is that the IO wasn't tuned with scylla setup. Either way, 3.0 is old and not supported, please upgrade to 4.2 or 4.1

On Tue, Dec 29, 2020 at 5:21 AM gaurarpit12 notifications@github.com wrote:

This is Scylla's bug tracker, to be used for reporting bugs only. If you have a question about Scylla, and not a bug, please ask it in our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

[] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details Scylla version (or git commit hash): 3.0.6 Cluster size: 5 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS 7

Hardware details (for performance issues) Delete if unneeded Platform (physical/VM/cloud instance type/docker): Physical Hardware: sockets=2 cores=28 hyperthreading=Enabled memory= 403GB Disks: (SSD/HDD, count) SSD, RAID1

There are several issues with this node:

Too many compactions happening on this node. See the following image showcasing a very high value for compaction queue length:

[image: image] https://user-images.githubusercontent.com/20061410/103285856-333f2400-4a05-11eb-9669-106c9d11fafd.png

Number of foreground and background reads too much for this node. See the following image for the same:

[image: image] https://user-images.githubusercontent.com/20061410/103286009-8c0ebc80-4a05-11eb-9033-0a9765c8c4cf.png

Too high read-latency for this particular node compared to other nodes. Please have a look at the following image for the same: [image: image] https://user-images.githubusercontent.com/20061410/103286505-430b3800-4a06-11eb-8cfe-bbb9a2955ed8.png

We saw these issues earlier too, but reduced our data. But this time it's getting very painful. At the node level, seeing the errors of the kind shown below, and that too continuously: storage_proxy - Exception when communicating with 10.4.106.23: seastar::semaphore_timed_out (Semaphore timedout)

I suspect it to be a bug with scylla-version 3.0.x, hence reporting it here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/7847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURNHXSK3GAFYJ6XBOIDSXHJWJANCNFSM4VNGQEAA .

gaurarpit12 commented 3 years ago

Scylla is the only service running on the machine.

gaurarpit12 commented 3 years ago

scylla_setup was run as soon as scylla was installed.

slivne commented 3 years ago

@gaurarpit12 you are using Scylla 3.0.6 - we only support the last two scylla open source releases which currently is 4.1, 4.2 (soon 4.3 is going out).

With regards to this being a bug - you may be right - we fixed many bugs since then - and given the time that had passed its hard for us to provide feedback as to what bug it maybe.

If you share a bit more about the use case , such as which compaction strategy you may be using - it may help.

Please note that in 3.0 (if I recall correctly) compaction was made more aggressive in STCS - you can read about that here https://docs.scylladb.com/getting-started/compaction/#stcs-options

So if you had a different settings between the nodes in the yaml - that may be the source (wild guess - that may not be true).

Thanks for reporting the issue, if you do hit the same issue after upgrade to a supported open source release - we will work to debug and solve it with you.

gaurarpit12 commented 3 years ago

The upgrade process from 3.0 to the latest is quite painful as it involves too many steps. Can't there be any one-step process?

slivne commented 3 years ago

For rolling upgarde - where there is traffic ongoing we only test and verify upgrades between major releases

3.0 --> 3.1 --> 3.2 --> 3.3 --> 4.0 --> 4.1 --> 4.2

If you can have downtime - you can try:

take a backup
shutdown scylla on all nodes
upgrade all the nodes
start the nodes (starting from the seeds)
only once all nodes are up restore traffic if you do decide to try this out I please test it on the side - we do not have tests for this - but theoretically it should work (and if it doesn;t you will be able to restore the old version and restore the backup.

gaurarpit12 commented 3 years ago

Hi @slivne , The rolling upgrade would require me to download scylla 3.1 and 3.2 versions as well. However, these are not available on your site to download. How to move forward with the upgrade then?

avikivity commented 3 years ago

/cc @tzach

gaurarpit12 commented 3 years ago

I added a new node to the cluster 4 days back. Now the issue that was being faced for the previous node got transferred to this new node. Read and write timeouts are shooting up every now and then. A lot of read and write errors are being faced. The read latency is approximately 2 seconds and that's quite huge. The upgrade has to be done asap, but the versions 3.1 and 3.2 are unavailable to download. What shall I do now? How to upgrade the cluster?

slivne commented 3 years ago

@gaurarpit12 although they are not in the drop down list - you can access the instructions and artifacts

its a bit hacky but it works - change the link at the top of the page to have scylla-3.1 / scylla-3.2 and you will find what you are looking for

scylladb / scylladb

Too many compactions and reads on one of the scylla nodes in a cluster compared to others. #7847