Closed aleksbykov closed 1 year ago
Nodes are avialable:
emini-basic-3h-Add-gemi-oracle-db-node-27c2f9d1-1 34.243.104.238 running aws eu-west-1
gemini-basic-3h-Add-gemi-db-node-27c2f9d1-1 52.31.170.158 running aws eu-west-1
gemini-basic-3h-Add-gemi-db-node-27c2f9d1-3 34.245.91.103 running aws eu-west-1
gemini-basic-3h-Add-gemi-db-node-27c2f9d1-2 34.245.88.176 running aws eu-west-1
gemini-basic-3h-Add-gemi-monitor-node-27c2f9d1-1 63.35.237.223 running aws eu-west-1
@aleksbykov is this a regression ?
does this somehow relate to #7853 ?
Not sure that it is a regression. But i didn't see such behavior previously when gemini generate MV/SI.
Each gemini run generates new ransom seed. I will check whether we have run with same seed and what of their status. Issue 7853 use same gemini command and same seed (schema's are the same) and behavior looks same that MV backpressure has very high metrics
the issue reproduced with another job: longevity-mv-si-4days-test. same errors were found during and after the nemesis RebuildStreaminErr:
2021-02-03T11:56:14+00:00 longevity-mv-si-4d-4-4-db-node-398786c2-1 !ERR | scylla: [shard 9] storage_proxy - Exception when communicating with 10.0.2.94, to read from sec_index.users: std::runtime_error (_read_concurrency_sem: restricted mutation reader queue overload)
Monitoring: http://18.195.49.248:3000/d/x7IlS1LMz/longevity-mv-si-4days-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1612350036518&to=1612354538589 Scylla version: 4.4.rc1-0.20210131.fd868722d with build-id 81a1e1af8194f89648f0de52d16221ac170c539a
without ENOSPACE
Some basic analysis from the metrics:
It looks like the node that errors out accumulates a lot of background reads. For some reason there is a spike in the cache misses for the same node around the same time:
And this causes to tousands of queued sstable reads:
It looks like we are under some kind of an overload but it is unclear why we don't get out of it. @aleksbykov is there any way to run this with the same seed? I would like to see if this is reproducible. In the meantime I will look gemmini's logs to see if I can figure out how could that have happened (maybe the queries and the rate will tell us something interesting)
@psarna / @nyh is this possible that the reads are in the background because those are MV reads? I see that the view backlog is on steady 0 but maybe something is wrong with the counter? Do we have a metric that counts background reads for view update separately from "normal" background reads?
Timeouts are correlated with background reads, pretty much 1:1. Once a read request times out on the coordinator it's automatically categorized as a background read, because the coordinator responds to the client, but also still waits for a replica to respond. So perhaps a replica overload?
Doesn't sound related to views at all. Materialized views are a cause of background writes, but the read path is literally the same as a regular table (well, except for read-before-write, but that doesn't happen in the background).
(well, except for read-before-write, but that doesn't happen in the background).
Just to see that I understand correctly, when a table with views is written to, we don't come back to the client before we have prepared the view updates/ view update mutations to be sent?
@eliransin i can run a job with reproduce with same gemini seed
(well, except for read-before-write, but that doesn't happen in the background).
Just to see that I understand correctly, when a table with views is written to, we don't come back to the client before we have prepared the view updates/ view update mutations to be sent?
I believe so - if I remember correctly, we even send the updates before applying the actual local write, and only then return. We just don't wait for the updates to finish (unless the updates are also local, in which case we do).
@aleksbykov please do, if this reproduces it can be yet another overload scenario :slightly_smiling_face: While you do this I will keep digging into what we already have maybe something will pop up.
the issue reproduced with another job: longevity-mv-si-4days-test. same errors were found during and after the nemesis RebuildStreaminErr:
2021-02-03T11:56:14+00:00 longevity-mv-si-4d-4-4-db-node-398786c2-1 !ERR | scylla: [shard 9] storage_proxy - Exception when communicating with 10.0.2.94, to read from sec_index.users: std::runtime_error (_read_concurrency_sem: restricted mutation reader queue overload)
Monitoring: http://18.195.49.248:3000/d/x7IlS1LMz/longevity-mv-si-4days-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1612350036518&to=1612354538589 Scylla version: 4.4.rc1-0.20210131.fd868722d with build-id 81a1e1af8194f89648f0de52d16221ac170c539a
We still see this issue in https://jenkins.scylladb.com/job/scylla-4.4/job/longevity/job/longevity-mv-si-4days-test/3
@eliransin - issue reproduced
@psarna please have a look as it might be connected to overload. @amoskong please keep the cluster alive for us at least for tomorrow and then @psarna will tell us if he needs to extract something from it.
Issue reproduced also in 2021.1.rc2-0.20210502.f3e5e1f12.
Also in joblongevity-mv-si-4d
during RebuildStreamingErr
Test-id: 53d9ce43-5ff6-488f-94ab-d764ad1dcee0
Issue reproduced:
Installation details
Scylla version (or git commit hash): 4.4.4-0.20210801.69daa9fd0 with build-id fdea767545b17ae9adf1263dbd464c26712e698e
Cluster size: 3 nodes (i3.large)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-01fc47b31caf71165
(aws: eu-north-1)
Gemini command:
/$HOME/gemini -d --duration 3h --warmup 30m -c 100 -m mixed -f --non-interactive --cql-features normal --max-mutation-retries 5 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '1'}" --test-cluster=10.0.0.91,10.0.2.37,10.0.0.141 --outfile /home/centos/gemini_result_7d6290e0-b2aa-4f56-9341-d8433e77bfc4.log --seed 8 --oracle-cluster=10.0.1.9
Gemini version: 1.7.5
Test: gemini-3h-with-nemesis-test
Test name: gemini_test.GeminiTest.test_load_random_with_nemesis
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 075b3773-17dd-49c2-b684-dd4b39b3552e
Show all stored logs command: $ hydra investigate show-logs 075b3773-17dd-49c2-b684-dd4b39b3552e
Test id: 075b3773-17dd-49c2-b684-dd4b39b3552e
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_193208/grafana-screenshot-gemini-3h-with-nemesis-test-scylla-per-server-metrics-nemesis-20210803_193548-gemini-with-nemesis-3h-normal-4-4-monitor-node-075b3773-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_193208/grafana-screenshot-overview-20210803_193208-gemini-with-nemesis-3h-normal-4-4-monitor-node-075b3773-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_194025/db-cluster-075b3773.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_194025/loader-set-075b3773.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_194025/monitor-set-075b3773.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/075b3773-17dd-49c2-b684-dd4b39b3552e/20210803_194025/sct-runner-075b3773.zip
Gemini comand:
gemini -d --duration 3h --warmup 30m -c 100 -m mixed -f --non-interactive --cql-features normal --max-mutation-retries 5 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '1'}" --test-cluster=10.0.0.91,10.0.2.37,10.0.0.141 --outfile /home/centos/gemini_result_7d6290e0-b2aa-4f56-9341-d8433e77bfc4.log --seed 8 --oracle-cluster=10.0.1.9
Used schema:
Schema: {
"keyspace": {
"name": "ks1",
"replication": {
"class": "SimpleStrategy",
"replication_factor": "3"
},
"oracle_replication": {
"class": "SimpleStrategy",
"replication_factor": "1"
}
},
"tables": [
{
"name": "table1",
"partition_keys": [
{
"name": "pk0",
"type": "int"
},
{
"name": "pk1",
"type": "varint"
},
{
"name": "pk2",
"type": "int"
}
],
"clustering_keys": [
{
"name": "ck0",
"type": "blob"
},
{
"name": "ck1",
"type": "double"
}
],
"columns": [
{
"name": "col0",
"type": "tinyint"
},
{
"name": "col1",
"type": {
"types": {
"udt_1677083946_0": "blob",
"udt_1677083946_1": "tinyint",
"udt_1677083946_2": "blob",
"udt_1677083946_3": "uuid",
"udt_1677083946_4": "timestamp"
},
"type_name": "udt_1677083946",
"frozen": true
}
},
{
"name": "col2",
"type": "boolean"
},
{
"name": "col3",
"type": {
"kind": "list",
"type": "bigint",
"frozen": false
}
},
{
"name": "col4",
"type": {
"kind": "list",
"type": "varchar",
"frozen": true
}
},
{
"name": "col5",
"type": {
"kind": "list",
"type": "boolean",
"frozen": false
}
},
{
"name": "col6",
"type": {
"types": {
"udt_1381445421_0": "tinyint",
"udt_1381445421_1": "decimal",
"udt_1381445421_2": "double"
},
"type_name": "udt_1381445421",
"frozen": true
}
},
{
"name": "col7",
"type": "date"
}
],
"indexes": [
{
"name": "table1_col0_idx",
"column": "col0",
"column_idx": 0
}
],
"materialized_views": [
{
"name": "table1_mv_0",
"partition_keys": [
{
"name": "col0",
"type": "tinyint"
},
{
"name": "pk0",
"type": "int"
},
{
"name": "pk1",
"type": "varint"
},
{
"name": "pk2",
"type": "int"
}
],
"clustering_keys": [
{
"name": "ck0",
"type": "blob"
},
{
"name": "ck1",
"type": "double"
}
]
}
],
I suspect that the problem in this workload this is the same issue as #8873:
@n
I suspect that the problem in this workload this is the same issue as #8873:
- Writes with CL < RF (e.g., ONE or QUORUM - @aleksbykov can you please confirm that this is the case?)
CL = QUORUM, RF =3
Got repro on master (4.6.dev), but note that concurrency reduced from 100 to 50:
Installation details
Kernel version: 5.4.0-1035-aws
Scylla version (or git commit hash): 4.6.dev-0.20210728.42e1f318d with build-id d23dd7c84a396ba1cdf9a93e82a05cb944e5c234
Cluster size: 3 nodes (i3.large)
Scylla running with shards number (live nodes):
gemini-with-nemesis-3h-normal-maste-db-node-51f8a9a0-1 (13.51.171.116 | 10.0.0.243): 2 shards
gemini-with-nemesis-3h-normal-maste-db-node-51f8a9a0-3 (13.48.55.63 | 10.0.0.76): 2 shards
gemini-with-nemesis-3h-normal-maste-db-node-51f8a9a0-4 (13.49.73.79 | 10.0.3.32): 2 shards
Scylla running with shards number (terminated nodes):
gemini-with-nemesis-3h-normal-maste-db-node-51f8a9a0-2 (13.51.233.236 | 10.0.1.17): 2 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0a67e737c28ecaa63
(aws: eu-north-1)
Gemini command:
/$HOME/gemini -d --duration 3h --warmup 30m -c 50 -m mixed -f --non-interactive --cql-features normal --max-mutation-retries 5 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '1'}" --test-cluster=10.0.0.243,10.0.1.17,10.0.0.76 --outfile /home/centos/gemini_result_3d37828f-639e-4ab3-856f-d79fcfeefc70.log --seed 12 --oracle-cluster=10.0.1.36
Gemini version: 1.7.5
Test: gemini-3h-with-nemesis-test
Test name: gemini_test.GeminiTest.test_load_random_with_nemesis
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 51f8a9a0-7f5c-4afb-bc22-215c105f8649
Show all stored logs command: $ hydra investigate show-logs 51f8a9a0-7f5c-4afb-bc22-215c105f8649
Test id: 51f8a9a0-7f5c-4afb-bc22-215c105f8649
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034411/grafana-screenshot-gemini-3h-with-nemesis-test-scylla-per-server-metrics-nemesis-20210729_034525-gemini-with-nemesis-3h-normal-maste-monitor-node-51f8a9a0-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034411/grafana-screenshot-overview-20210729_034412-gemini-with-nemesis-3h-normal-maste-monitor-node-51f8a9a0-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034655/db-cluster-51f8a9a0.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034655/loader-set-51f8a9a0.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034655/monitor-set-51f8a9a0.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/51f8a9a0-7f5c-4afb-bc22-215c105f8649/20210729_034655/sct-runner-51f8a9a0.tar.gz
Closing - we have enough similar issues - and I hope now we'll attend to them!
Installation details Scylla version (or git commit hash):
4.3.rc3-0.20201223.5bd52e4db with build-id 7855b9d820223e7134e84437248b1b474a492f64
Cluster size: 3 nodes (i3.large) OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-01ef5abee0187f85a
(aws: eu-west-1) Live monitor: http://63.35.237.223:3000/d/Kd83nlbGk/byo-abykov-staging-scylla-per-server-metrics-nemesis-master?orgId=1&from=1609250311180&to=1609253995687&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All&var-sct_tags=DisruptionEventGemini command:
Gemini version:
1.7.3
Test:
BYO-abykov-staging
Test name:gemini_test.GeminiTest.test_load_random_with_nemesis
Test config file(s):Issue description
==================================== Simple gemini test for 3h without any nemesis. Gemini generate next schema:
The schema on cluster:
After 32 minutes gemini was running, node 2 start to report the next errors:
Seems that this error terminate gemini with next messages
Restore Monitor Stack command:
$ hydra investigate show-monitor 27c2f9d1-c71a-4585-9037-9a9022a1188f
Show all stored logs command:$ hydra investigate show-logs 27c2f9d1-c71a-4585-9037-9a9022a1188f
Test id:
27c2f9d1-c71a-4585-9037-9a9022a1188f
Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/27c2f9d1-c71a-4585-9037-9a9022a1188f/20201229_145628/db-cluster-27c2f9d1.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/27c2f9d1-c71a-4585-9037-9a9022a1188f/20201229_145628/loader-set-27c2f9d1.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/27c2f9d1-c71a-4585-9037-9a9022a1188f/20201229_145628/monitor-set-27c2f9d1.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/27c2f9d1-c71a-4585-9037-9a9022a1188f/20201229_145628/sct-runner-27c2f9d1.zip
Jenkins job URL