Open denesb opened 1 year ago
@fruch ?
history of this test seem like it was failing with this for a while, and not it's failing with a different error:
> raise AssertionError('Critical errors found: {}\nOther errors: {}'.format( critical_errors, found_errors)) E AssertionError: Critical errors found: [('node1', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.']), ('node2', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.']), ('node3', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.'])] E Other errors: []
@bhalevy are you aware of those type of failures ?
Apparently these are due to issue in https://github.com/scylladb/scylladb/pull/15213 branch
this happened 5 times, since 10 Aug: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/98ae1610-4b1b-11ee-b96a-256b7e8b2261
I think it's enough time to take it out of the gate
@fruch - is it a recent addition to gating?
@fruch - is it a recent addition to gating?
it was added to the gating in the big batch added in a month ago: https://github.com/scylladb/scylla-dtest/pull/3452
@fruch - is it a recent addition to gating?
it was added to the gating in the big batch added in a month ago: scylladb/scylla-dtest#3452
So indeed a reasonable candidate to be taken out.
I looked a the test. My guess is that this is a (very minor) Scylla bug, not a test problem:
This test (test_populate_mv_after_insert_wide_rows) does:
The good news is that all of this passes successfully... The view gets built and read just fine. The only problem is that the test finds a message in the log:
ERROR 2023-08-14 07:32:54,996 [shard 0] view - Error applying view update to 127.0.53.3 (view: ks.t_by_v, base token: -4069959284402364209, view token: 8209703677368372043): exceptions::mutation_write_failure_exception (Operation failed for ks.t_by_v - received 0 responses and 1 failures from 1 CL=ONE.
This error must have been a transient error during the view building, as eventually we were able to read all the data so the view building must have finished.
The error mentions tokens, which can be translated to keys id=1 and v=34:
>>> from cassandra.metadata import Murmur3Token
>>> Murmur3Token.hash_fn(int.to_bytes(1, length=4))
-4069959284402364209
>>> for i in range(10000):
... if Murmur3Token.hash_fn(int.to_bytes(i, length=4)) == 8209703677368372043:
... print(i)
...
34
>>> Murmur3Token.hash_fn(int.to_bytes(34, length=4))
8209703677368372043
I don't know what is special in this particular row and can't find the Scylla log file (@fruch, can you help me) to see if there are any other error messages that happened during the view build. In particular, this error message is rate limited - so it might be the first of 100 messages (e.g. all ids between 0 and 100, an 34 just happened to be printed first). Or not, I don't know.
I suspect that the view building on node 1 experienced a transient problem which prevented the view update from being sent to the view replica on node 3 (note that because N=3, RF=1, view replicas are typically on a different node on this test). I can't be sure why it failed without finding the logs of node 3 (@fruch maybe you can help me), but I have a guess, that will need to proved or discredited:
View building starts on node 1 as soon as node 1 learns that the CREATE MATERIALIZED VIEW has happened. However, it is conceivable that when node 1 starts to send view updates, and decides it should send some of them to node 3, node 3 doesn't yet know about this view's existence so the write fails to node 3. The view building will later try again, and by that time node 3 will have received the updated schema, and the second write will succeed and the view building will proceed.
If this explanation is correct, we can modify the view building so that it doesn't start until all live nodes are aware of the new view. Perhaps waiting for schema agreement is the right way to go. Alternatively, we should consider not reporting errors during view build as "ERROR"s, since the view building retries anyway, and these should be considered INFO at most, if at all.
CC @cvybhu @eliransin
we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else.
@benipeled can we enlarge the dtest gating job history a bit ?
meanwhile there are similar tests with MVs that having similar failures here's one: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/testReport/junit/concurrent_schema_changes_test/TestConcurrentSchemaChanges/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_create_lots_of_mv_concurrently/
and a link to the dashboard show similar ones (i.e. that have the same error mentioned in thier failure): https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/557f8e00-4d96-11ee-b96a-256b7e8b2261
meanwhile there are similar tests with MVs that having similar failures here's one: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/testReport/junit/concurrent_schema_changes_test/TestConcurrentSchemaChanges/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_create_lots_of_mv_concurrently/
Thanks, this one indeed looks similar. This one also has an error in node1 trying to send view updates to node2. The node1 log https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/artifact/logs-full.release.010/1693892699459_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_create_lots_of_mv_concurrently/node1.log has:
INFO 2023-09-05 05:43:45,833 [shard 0:stre] schema_tables - Creating lots_o_views.src_by_c2 id=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d version=dae806bf-19ea-3f75-99c8-037d25edb1d6
INFO 2023-09-05 05:43:45,835 [shard 0:stre] view - Building view lots_o_views.src_by_c2, starting at token minimum token
ERROR 2023-09-05 05:43:45,836 [shard 0:stre] view - Error applying view update to 127.0.75.2 (view: lots_o_views.src_by_c2, base token: -8978909661410770097, view token: 1287679541648494871): exceptions::mutation_write_failure_exception (Operation failed for lots_o_views.src_by_c2 - received 0 responses and 1 failures from 1 CL=ONE.)
WARN 2023-09-05 05:43:45,836 [shard 0:stre] view - Error executing build step for base lots_o_views.source_data: exceptions::mutation_write_failure_exception (Operation failed for lots_o_views.src_by_c2 - received 0 responses and 1 failures from 1 CL=ONE.)
INFO 2023-09-05 05:43:47,016 [shard 0:stre] view - Finished building view lots_o_views.src_by_c2
As I assumed, the view building on node started 2 milliseconds after the MV was created on node 1, 2023-09-05 05:43:45,835
. Just a millisecond later, the first write to node 2 failed. Looking at the node 2 log https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/artifact/logs-full.release.010/1693892699459_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_create_lots_of_mv_concurrently/node2.log we see:
INFO 2023-09-05 05:43:45,829 [shard 0:stat] migration_manager - Create new view: org.apache.cassandra.config.CFMetaData@0x600001130400[cfId=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d,
INFO 2023-09-05 05:43:45,834 [shard 0:stre] schema_tables - Altering lots_o_views.source_data id=1fc41c10-4baf-11ee-908b-f711dd34e11d version=aaedc4a9-12ec-3263-b5d9-c91e3f2e11f9
INFO 2023-09-05 05:43:45,834 [shard 0:stre] schema_tables - Creating lots_o_views.src_by_c2 id=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d version=dae806bf-19ea-3f75-99c8-037d25edb1d6
WARN 2023-09-05 05:43:45,836 [shard 0:stat] storage_proxy - Failed to apply mutation from 127.0.75.1#0: data_dictionary::no_such_column_family (Can't find a column family with UUID 2e01bd50-4baf-11ee-988f-c6c02f6a2b6d)
I think this confirms my guess - node 2 receives view updates two milliseconds after learning of the view, and apparently it's not yet fully aware of this very recently-added table and fails to find it.
As this is a (minor) Scylla bug, not a dtest flakiness, I'll change the issue title. I'll include here the failed dtest names for searchability:
materialized_views_test.py
::TestMaterializedViews
::test_populate_mv_after_insert_wide_rows
concurrent_schema_changes_test.py
::TestConcurrentSchemaChanges
::test_create_lots_of_mv_concurrently
@nyh - the retry mechanism is based on hints, btw? I think it's very interesting to figure out the reason for the transient failure - maybe will give us a clue wrt inconsistent MVs.
I think I explained above the reason for the transient failure - node 1 started view building of pre-existing data, sent a view update to node 2 - just before node 2 fully learned about the view existing so it refuses the write. IIRC there are no hints involved here - view building remembers its position and what it needs to resend. Also note that the view is fully consistent at the end - the only "bug" here is the log spam.
I think I explained above the reason for the transient failure - node 1 started view building of pre-existing data, sent a view update to node 2 - just before node 2 fully learned about the view existing so it refuses the write. IIRC there are no hints involved here - view building remembers its position and what it needs to resend. Also note that the view is fully consistent at the end - the only "bug" here is the log spam.
Maybe in this particular case, but I'm concerned about not properly synchronizing with schema changes. See https://github.com/scylladb/scylladb/issues/15048#issuecomment-1677241939 (and the issues look duplicated)
if so @nyh, can change those log print into a warning ? so we won't need to filter them all over the place ?
@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.
@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.
it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them
@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.
it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them
We haven't had those failures before - or have we? We haven't changed the logging, to the best of my knowledge. If it's a benign issue - even better, we can fix it on the side branch, then re-apply either the test, the fix or both and get them back into gating. Right now, it's (yet another) source for gating flakiness.
we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else.
@benipeled can we enlarge the dtest gating job history a bit ?
Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60
we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?
Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60
@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.
@benipeled
https://github.com/scylladb/scylla-dtest/pull/3567 it was by number of jobs, and the day for the artifacts doesn't really helps, since we keep the links to the job, and not to specific artifacts.
We haven't had those failures before - or have we? We haven't changed the logging, to the best of my knowledge. If it's a benign issue - even better, we can fix it on the side branch, then re-apply either the test, the fix or both and get them back into gating. Right now, it's (yet another) source for gating flakiness.
I think it's perhaps not a common occurrence, the view build on the first node races with the second node learning that the view exists. Perhaps the schema-changing usually wins the race. I don't know why it has become (if at all) more common than it was in the past. Moreover, this problem only applies to tests that do view building (adding a view to a table with pre-existing data) - many MV tests add data to the view after it's created and are not vulnerable to this issue.
we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?
Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60
@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long
We can keep the job themself for long time but it means console only which is not enough
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.
What about this build? it has all artifacts
we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?
Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60
@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long
We can keep the job themself for long time but it means console only which is not enough
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.
What about this build? it has all artifacts
yeah, when there no other errors causing it not to have it :)
@fruch - did we take this out of gating? @nyh - is it reasonable to change this ERROR to WARN?
@fruch - did we take this out of gating? @nyh - is it reasonable to change this ERROR to WARN?
yes, cause of those kind of issues, all of those test we taken out of gating https://github.com/scylladb/scylla-dtest/commit/17f9d015868ffb6fb27595099ce63dcc502832c7
we are seeing this a lot in SCT as well:
2023-09-30 06:39:50.379 <2023-09-30 06:39:50.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=04f44e1e-ae85-4253-9ac6-f6f3556af219: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=6245 node=longevity-large-partitions-200k-pks-db-node-d62e8bac-0-3
2023-09-30T06:39:50+00:00 longevity-large-partitions-200k-pks-db-node-d62e8bac-0-3 !ERR | scylla[6843]: [shard 11:stre] view - (rate limiting dropped 26 similar messages) Error applying view update to 10.142.0.111 (view: scylla_bench.view_test, base token: 3520969294134539810, view token: 8659755778285643669): exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
repeating @mykaul, can we change those print level ?, or we need to a add code to ignore all over the place ?
I hundreds of those errors prints in a case which mostly do topology changes on a multi-dc cluster:
2023-10-02 17:01:44.225 <2023-10-02 17:01:44.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=ec00c543-bb0f-484d-9800-bc383f0c5f15: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2187 node=parallel-topology-schema-changes-mu-db-node-7162315c-3
2023-10-02T17:01:44+00:00 parallel-topology-schema-changes-mu-db-node-7162315c-3 !ERR | scylla[5623]: [shard 0:stat] view - Error applying view update to 10.4.9.255 (view: keyspace1.standard1_c3_nemesis_index, base token: 8687565012191682492, view token: -783736015806138877): data_dictionary::no_such_column_family (Can't find a column family with UUID 5d06aba1-6145-11ee-9092-ffeb434a7910)
when should we expect those error ?, it's not just on index creation, and seems like it might happen any time a new is added. or a node is stopped.
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20231002.1640f83fdc31
with build-id d751f4ab981bda6f045bb902f0d341e61f0ac3a7
Cluster size: 12 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0000f82a29037b49d ami-0205476ccb2b99643
(aws: undefined_region)
Test: longevity-multidc-schema-topology-changes-12h-test
Test id: 7162315c-e5dc-41a6-940d-13e7997f7162
Test name: scylla-master/longevity/longevity-multidc-schema-topology-changes-12h-test
Test config file(s):
I think we can change the log level to WARN
, a failing view update isn't a fatal event.
Opened: https://github.com/scylladb/scylladb/pull/15640
@nyh view updates should be signed with schema version just like normal writes. And schema version fencing should work. Why doesn't it?
@nyh Do you think it would be enough to change the log level to WARN? or do we need to investigate what @kostja recommended, to find out why version fencing didnt work this time?
I don't know what Kostja meant about schema versions and fencing (he may be right, but I'm just not experienced enough in this area), so I'll use simpler words:
It is pointless for any node to start its view building process before it is sure that all nodes know of the new view, because this node will surely need to send updates to every other nodes - and it will fail if the other nodes don't know on the view building.
So I think the real fix is to delay the start of view building a bit - until schema agreement, and I think we should keep this open until we do this.
I don't know what is the right "process" to wait until all the nodes know of the new view, and whether we should wait for every single node (in any case, view building is likely to pause if it can't reach one of the nodes, so the view building process does depend on no node being down).
@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.
it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them
I missed that this was done. In my analysis above I thought that only a couple of specific tests had this problem - not all of them... So,... Now it turns out for a few months (since September), all of the materialized views tests have been off gating. That's not good. So I guess we need to fix this issue once and for all to get the tests back to gating! I see that @kostja assigned this issue to @wmitros?
In comment https://github.com/scylladb/scylladb/issues/15046#issuecomment-1932642793 I suggested what I think the "best" thing to do would be - to delay view building, but perhaps a simpler and good-enough fix would be to downgrade the ERROR to WARN, considering that the view building is already coded to ignore these errors and try again (which is why it works correctly, besides printing those scary errors).
Saw in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3045/testReport/junit/materialized_views_test/TestMaterializedViews/Tests___Sanity_Tests___test_populate_mv_after_insert_wide_rows/
I couldn't reproduce this locally.
@fruch is this one of the new tests, added to gating?