Open lucowehrlin opened 1 week ago
We had the issue twice that scylla did a lot of very small flushes, causing many small compactions. This caused writes to become very slow.
Since #20991 does not seem to be root cause, it might be that the flushes were somehow memory related. What we can see is a lot of allocation failures in our logs.
@ptrsmrn - the allocation above seem to be in the CQL territory - can you take a look?
FYI: We are often times reading/writing blobs up to 3MB in size.
Question is: Can this lead to some kind of low memory situation where scylla 6.0 is getting into a situations where its constantly flushes?
CQL allocates: oversized allocation: 2768896 bytes
, which is roughly what @horschi reported: FYI: We are often times reading/writing blobs up to 3MB in size.
@bhalevy are you aware what can cause so many mini flushes? Also, the 2nd stacktrace mentions using row_cache and LSA allocation failure
- this is some hint, but I am not familiar with this area.
What's the underlying counter for "Memtable switches"? I can't find it in scylla-monitoring.git.
Please upload a snapshot of your monitoring database, and indicate a time period to look at,maybe we can find a clue there.
What's the underlying counter for "Memtable switches"? I can't find it in scylla-monitoring.git.
rate(scylla_column_family_memtable_switch[1m])
Please upload a snapshot of your monitoring database, and indicate a time period to look at,maybe we can find a clue there.
I will see that we provide that.
Close up on the time when we restarted:
On a per CF basis:
(rate(scylla_column_family_memtable_switch[1m]))
It shows a couple of hosts having that flush-storm and with the restart the numbers go down. The flushes are on all active tables.
Per host basis:
sum by (instance) (rate(scylla_column_family_memtable_switch[1m]))
Edit: added second screenshot
Still worried about our findings here: https://forum.scylladb.com/t/compaction-storm-slows-down-scylla/2958/24
While one observation seems to have resulted in a fixable bug, (https://github.com/scylladb/scylladb/issues/20991), we are still not understanding what caused the "storm" of mini flushes, that slowed down the database heavily.
One observation was that we seen multiple memory issues during the "storm", showing some oversized allocation, and shortly thereafter multiple LSA related errors, like the following below.
The other memory related issue was this: