Open vponomaryov opened 22 hours ago
Alive yet monitoring node: https://eu-west-1.console.aws.amazon.com/ec2/home?region=eu-west-1#InstanceDetails:instanceId=i-022788c782a7a759c
@fruch , @roydahan this is ^ the same problem observed here: enterprise-2024.2/reproducers/scale-5000-tables-test#3
@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement
@amnonh
I've found we have some metrics which are tables specific like scylla_column_family_memtable_row_hits
quick tour of the scylla code, and I've found the flag that enable it:
https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012
and it's default on
is that on purpose ?
I've found we have some metrics which are tables specific like
scylla_column_family_memtable_row_hits
quick tour of the scylla code, and I've found the flag that enable it: https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012and it's default on
is that on purpose ?
Yes.
I've found the answer, https://github.com/scylladb/scylladb/pull/13293
yes it was deliberately
and @tzach you got the benchmark you asked back then :) it's bad, and the calculator from https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement doesn't help much when you have 5000+ tables
we have t3.large
for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...
@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement
we have
t3.large
for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...
In the test run used for the bug report was used following instance type for the monitoring node: m6i.xlarge
@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement
we have
t3.large
for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...In the test run used for the bug report was used following instance type for the monitoring node:
m6i.xlarge
Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.
Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.
Number of Series | Number of Chunks | Number of Label Pairs | Current Min Time | Current Max Time |
---|---|---|---|---|
2843270 | 15940315 | 12293 | 2024-12-01T06:00:00.714Z (1733032800714) | 2024-12-01T09:37:40.845Z (1733045860845) |
Name | Count |
---|---|
cf | 10056 |
name | 1193 |
le | 143 |
type | 115 |
devices | 83 |
handler | 51 |
collector | 46 |
name | 35 |
cpu | 32 |
shard | 30 |
Name | Count |
---|---|
scylla_column_family_write_latency_bucket | 1366170 |
scylla_column_family_read_latency_bucket | 679835 |
wlatencyaks | 55646 |
wlatencyp95ks | 55646 |
wlatencyp99ks | 55646 |
scylla_column_family_cache_hit_rate | 50280 |
scylla_column_family_live_sstable | 50280 |
scylla_column_family_total_disk_space | 50280 |
scylla_column_family_live_disk_space | 50280 |
rlatencyp99ks | 27724 |
Name | Bytes |
---|---|
name | 106236467 |
cf | 44598208 |
cluster | 28081620 |
instance | 26033983 |
le | 25698553 |
dc | 25301132 |
job | 15793954 |
ks | 13335682 |
by | 2367290 |
class | 339838 |
Name | Count |
---|---|
dc=eu-west-1 | 2810750 |
cluster=my-cluster | 2808162 |
ks=feeds | 2664829 |
job=scylla | 2556887 |
name=scylla_column_family_write_latency_bucket | 1366170 |
instance=10.4.4.193 | 853470 |
instance=10.4.6.77 | 853400 |
instance=10.4.4.64 | 853271 |
name=scylla_column_family_read_latency_bucket | 679835 |
instance=10.4.6.145 | 132761 |
Installation details Panel Name: any Dashboard Name: any Scylla-Monitoring Version:
4.8.0
Scylla-Version:2024.2.0~rc3-20241004.89f8638e9e9b
Monitor node instance type:m6i.xlarge
Running a test which creates tables in batches by 125 we observe constant memory and CPU utilization growth:
The same about disk utilization:
Result of the
top
command:DB nodes load:
On the DB nodes load screenshot may be observed the situation with batches. Each
tooth
is population of the 125 tables.Argus: scylla-staging/valerii/vp-scale-5000-tables-test#3 CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-scale-5000-tables-test/3