vponomaryov commented 22 hours ago

Installation details Panel Name: any Dashboard Name: any Scylla-Monitoring Version: 4.8.0 Scylla-Version: 2024.2.0~rc3-20241004.89f8638e9e9b Monitor node instance type: m6i.xlarge

Running a test which creates tables in batches by 125 we observe constant memory and CPU utilization growth:

The same about disk utilization:

Result of the top command:

Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.6 us,  0.2 sy,  0.0 ni, 74.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15717.2 total,   1244.2 free,  12641.1 used,   1831.9 buff/cache
MiB Swap:  20480.0 total,  16750.0 free,   3730.0 used.   2393.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                
   5527 ubuntu    20   0  113.7g  12.1g 570644 S 100.3  79.1   8266:01 prometheus                                                                                                                                                                                             
   9710 scylla    20   0   16.0t  76860  20480 S   1.0   0.5  92:07.13 scylla                                                                                                                                                                                                 
    414 root      20   0 1949744  17860   8192 S   0.3   0.1   4:16.46 containerd                                                                                                                                                                                             
   2977 root      20   0 2134828  32928  14080 S   0.3   0.2   3:13.23 dockerd                                                                                                                                                                                                
   5508 root      20   0 1238716   6408   3456 S   0.3   0.0   1:22.68 containerd-shim                                                                                                                                                                                        
   9718 scylla-+  20   0 1266796  25560  11904 S   0.3   0.2   4:57.53 scylla-manager                                                                                                                                                                                         
  57000 root      20   0 1319948  24704  16768 S   0.3   0.2   0:00.04 snapd                                                                                                                                                                                                  
      1 root      20   0  167584   6480   4048 S   0.0   0.0   0:23.68 systemd                                                                                                                                                                                                
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.06 kthreadd                                                                                                                                                                                               
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp

DB nodes load:

On the DB nodes load screenshot may be observed the situation with batches. Each tooth is population of the 125 tables.

Argus: scylla-staging/valerii/vp-scale-5000-tables-test#3 CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-scale-5000-tables-test/3

vponomaryov commented 22 hours ago

Alive yet monitoring node: https://eu-west-1.console.aws.amazon.com/ec2/home?region=eu-west-1#InstanceDetails:instanceId=i-022788c782a7a759c

@fruch , @roydahan this is ^ the same problem observed here: enterprise-2024.2/reproducers/scale-5000-tables-test#3

tzach commented 21 hours ago

@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

fruch commented 21 hours ago

@amnonh

I've found we have some metrics which are tables specific like scylla_column_family_memtable_row_hits quick tour of the scylla code, and I've found the flag that enable it: https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012

and it's default on

is that on purpose ?

mykaul commented 21 hours ago

@amnonh

I've found we have some metrics which are tables specific like scylla_column_family_memtable_row_hits quick tour of the scylla code, and I've found the flag that enable it: https://github.com/scylladb/scylladb/blob/acd643bd75468703150b2e23b1bbf05a3e95e42d/db/config.cc#L1012

and it's default on

is that on purpose ?

Yes.

fruch commented 21 hours ago

I've found the answer, https://github.com/scylladb/scylladb/pull/13293

yes it was deliberately

and @tzach you got the benchmark you asked back then :) it's bad, and the calculator from https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement doesn't help much when you have 5000+ tables

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

vponomaryov commented 21 hours ago

@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

In the test run used for the bug report was used following instance type for the monitoring node: m6i.xlarge

mykaul commented 5 hours ago

@vponomaryov do the monitoring server match the Memory Space requirement https://monitoring.docs.scylladb.com/stable/install/monitoring-stack.html#calculating-prometheus-minimal-memory-space-requirement

we have t3.large for the monitor, which maybe not exactly as the calculator suggest, but two years ago it was working o.k. for this case...

In the test run used for the bug report was used following instance type for the monitoring node: m6i.xlarge

Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.

vponomaryov commented 4 hours ago

Please fetch from Prometheus UI the TSDB status page, which will help us analyzing this.

TSDB Status

Head Stats

Number of Series	Number of Chunks	Number of Label Pairs	Current Min Time	Current Max Time
2843270	15940315	12293	2024-12-01T06:00:00.714Z (1733032800714)	2024-12-01T09:37:40.845Z (1733045860845)

Head Cardinality Stats

Top 10 label names with value count

Name	Count
cf	10056
name	1193
le	143
type	115
devices	83
handler	51
collector	46
name	35
cpu	32
shard	30

Top 10 series count by metric names

Name	Count
scylla_column_family_write_latency_bucket	1366170
scylla_column_family_read_latency_bucket	679835
wlatencyaks	55646
wlatencyp95ks	55646
wlatencyp99ks	55646
scylla_column_family_cache_hit_rate	50280
scylla_column_family_live_sstable	50280
scylla_column_family_total_disk_space	50280
scylla_column_family_live_disk_space	50280
rlatencyp99ks	27724

Top 10 label names with high memory usage

Name	Bytes
name	106236467
cf	44598208
cluster	28081620
instance	26033983
le	25698553
dc	25301132
job	15793954
ks	13335682
by	2367290
class	339838

Top 10 series count by label value pairs

Name	Count
dc=eu-west-1	2810750
cluster=my-cluster	2808162
ks=feeds	2664829
job=scylla	2556887
name=scylla_column_family_write_latency_bucket	1366170
instance=10.4.4.193	853470
instance=10.4.6.77	853400
instance=10.4.4.64	853271
name=scylla_column_family_read_latency_bucket	679835
instance=10.4.6.145	132761

scylladb / scylla-monitoring

Monitoring node runs out of RAM and CPU resources with growth of the tables number and data in it #2429