scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
20 stars 64 forks source link

scylla_node: derive mem_mb_per_cpu from --smp and --memory options #443

Closed bhalevy closed 1 year ago

bhalevy commented 1 year ago

We'd like to use the values provided in the SCYLLA_EXT_OPTS environment variable as defaults, but also derive self._mem_mb_per_cpu from them.

Then, if a test passes --smp, without --memory in jvm_args we should calculate --memory from self._mem_mb_per_cpu * _smp.

Otherwise, the default --memory parameter given in SCYLLA_EXT_OPTS could be too small if the test uses more shards than the default.

See for example https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/216/artifact/logs-full.release.018/dtest-gw2.log that times out when bootstrapping new nodes with smp=8 and memory=1024M takes increasingly longer due to the immense memory pressure.

06:25:11,552 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node1: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node1/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node1/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.1', '--collectd-hostname', '315a1f84ff0e.node1', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.1', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:25:36,817 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node1: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node1/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.1', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.1', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.1', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node1/bin/scylla-jmx-1.0.jar']
06:25:37,228 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node2: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node2/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node2/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.2', '--collectd-hostname', '315a1f84ff0e.node2', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.2', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:27:42,292 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node2: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.2', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node2/bin/scylla-jmx-1.0.jar']
06:27:42,654 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node3: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node3/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node3/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.3', '--collectd-hostname', '315a1f84ff0e.node3', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.3', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:30:16,851 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node3: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node3/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.3', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.3', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.3', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node3/bin/scylla-jmx-1.0.jar']
06:30:17,220 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node4: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node4/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node4/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.4', '--collectd-hostname', '315a1f84ff0e.node4', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.4', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:32:54,515 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node4: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node4/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.4', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.4', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.4', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node4/bin/scylla-jmx-1.0.jar']
06:32:55,036 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node5: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node5/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node5/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.5', '--collectd-hostname', '315a1f84ff0e.node5', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:39:47,852 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node5: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node5/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.5', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.5', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.5', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node5/bin/scylla-jmx-1.0.jar']
06:39:49,106 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node6: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node6/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node6/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.6', '--collectd-hostname', '315a1f84ff0e.node6', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.6', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
06:53:51,523 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node6: Starting scylla-jmx: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node6/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.29.6', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.29.6', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.29.6', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node6/bin/scylla-jmx-1.0.jar']
06:53:52,710 740     ccm                            DEBUG    cluster.py          :711  | test_lwt_load: node7: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node7/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node7/conf/scylla.yaml', '--log-to-stdout', '1', '--smp', '8', '--api-address', '127.0.29.7', '--collectd-hostname', '315a1f84ff0e.node7', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', '0', '--overprovisioned', '--prometheus-address', '127.0.29.7', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True
07:10:11,740 740     errors                         ERROR    conftest.py         :203  | test_lwt_load: test failed:

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/216/artifact/logs-full.release.018/1678691413194_lwt_schema_modification_test.py%3A%3ATestLWTSchemaModification%3A%3Atest_lwt_load/node1.log There are lots and lots of memory pressure indications, in the form of

INFO  2023-03-13 06:40:05,013 [shard 0] storage_service - entering BOOTSTRAP mode
INFO  2023-03-13 06:40:05,013 [shard 0] storage_service - Wait until local node knows tokens of peer nodes
INFO  2023-03-13 06:40:05,013 [shard 0] gossip - Waiting for pending range setup...
INFO  2023-03-13 06:40:05,113 [shard 2] compaction - [Compact system.local e3744c50-c169-11ed-a1e7-0397f3172e56] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-d2zbtz9j/test/node6/data/system/local-7ad54392bcdd35a684174e047860b377/me-58-big-Data.db:level=0]. 86kB to 45kB (~52% of original) in 78ms = 577kB/s. ~256 total partitions merged to 1.
WARN  2023-03-13 06:40:07,682 [shard 7] lsa-timing - reclaim took 29000 us, trying to release 0.076 MiB preemptibly, reserve: {goal: 1, max: 30}, at 0x5618fee 0x56195a0 0x5619888 0x1db463d 0x1db282c 0x1dd1ef2 0x1968046 0x528b704 0x528c987 0x52adbb1 0x525ee3a /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x8b12c /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x10cbbf
   --------
   seastar::coroutine::internal::maybe_yield_awaiter
...
WARN  2023-03-13 06:40:47,379 [shard 0] lsa-timing - reclaim took 28000 us, trying to release 8.127 MiB preemptibly, reserve: {goal: 1, max: 30}, at 0x5618fee 0x56195a0 0x5619888 0x1db463d 0x1db282c 0x1dd1ef2 0x1968046 0x528b704 0x528c987 0x528bcc9 0x5230eb5 0x5230028 0x11b3be4 0x11b56f0 0x11b2115 /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x2750f /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x275c8 0x11afc64
   --------
   seastar::coroutine::internal::maybe_yield_awaiter
WARN  2023-03-13 06:40:47,611 [shard 0] lsa-timing - reclaim took 43000 us, trying to release 8.127 MiB preemptibly, reserve: {goal: 1, max: 30}, at 0x5618fee 0x56195a0 0x5619888 0x1db463d 0x1db282c 0x1dd1ef2 0x1968046 0x528b704 0x528c987 0x528bcc9 0x5230eb5 0x5230028 0x11b3be4 0x11b56f0 0x11b2115 /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x2750f /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x275c8 0x11afc64
   --------
   seastar::coroutine::internal::maybe_yield_awaiter
WARN  2023-03-13 06:40:47,715 [shard 0] lsa-timing - reclaim took 69000 us, trying to release 8.127 MiB preemptibly, reserve: {goal: 1, max: 30}, at 0x5618fee 0x56195a0 0x5619888 0x1db463d 0x1db282c 0x1dd1ef2 0x1968046 0x528b704 0x528c987 0x528bcc9 0x5230eb5 0x5230028 0x11b3be4 0x11b56f0 0x11b2115 /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x2750f /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.ccm/scylla-repository/1379d8330f1966fec818cfe5567544091bbfdfd2/libreloc/libc.so.6+0x275c8 0x11afc64
   --------
   seastar::coroutine::internal::maybe_yield_awaiter

This apparently started happening after scylladb/scylladb@020483aa594c0978bd3696c0d2b316fb77db5b2e That changed:

diff --git a/main.cc b/main.cc
index bdb7730853..2d1781e2ef 100644
--- a/main.cc
+++ b/main.cc
@@ -476,7 +476,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
     // We need to have the entire app config to run the app, but we need to
     // run the app to read the config file with UDF specific options so that
     // we know whether we need to reserve additional memory for UDFs.
-    app_cfg.reserve_additional_memory = db::config::wasm_udf_reserved_memory;
+    app_cfg.reserve_additional_memory_per_shard = db::config::wasm_udf_reserved_memory;
     app_template app(std::move(app_cfg));

     auto ext = std::make_shared<db::extensions>();

Increasing the overall memory reservation for wasm and bringing scylla to its knees with 1024M total for 8 shards.

bhalevy commented 1 year ago

This PR replaces #437

$ git diff c4a5215 bef253b
diff --git a/ccmlib/scylla_node.py b/ccmlib/scylla_node.py
index 9fd540d..bf38ff1 100644
--- a/ccmlib/scylla_node.py
+++ b/ccmlib/scylla_node.py
@@ -590,7 +590,8 @@ class ScyllaNode(Node):
                     self._smp = int(v)
             elif k != '--memory':
                 args.append(k)
-                args.append(v)
+                if v:
+                    args.append(v)

         args.extend(translated_args)
fruch commented 1 year ago

Taking it for a ride with a bigger bunch of tests: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/281/

fruch commented 1 year ago

Taking it for a ride with a bigger bunch of tests: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/281/

all test run o.k. (some did fail but not related to this, as far as I can see) I think we should wait for the scylla-pkg fix, so it would get a proper next run.

bhalevy commented 1 year ago

Taking it for a ride with a bigger bunch of tests: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/281/

all test run o.k. (some did fail but not related to this, as far as I can see) I think we should wait for the scylla-pkg fix, so it would get a proper next run.

yes. thanks!

fruch commented 1 year ago

@bhalevy seem like there a need to rebase this one

bhalevy commented 1 year ago

rebased