scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.4k stars 1.27k forks source link

View build starts before other nodes aware of new view, resulting in ERROR-level log spam #15046

Open denesb opened 1 year ago

denesb commented 1 year ago

Saw in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3045/testReport/junit/materialized_views_test/TestMaterializedViews/Tests___Sanity_Tests___test_populate_mv_after_insert_wide_rows/

I couldn't reproduce this locally.

@fruch is this one of the new tests, added to gating?

mykaul commented 1 year ago

@fruch ?

bhalevy commented 1 year ago

history of this test seem like it was failing with this for a while, and not it's failing with a different error:

>           raise AssertionError('Critical errors found: {}\nOther errors: {}'.format(
                critical_errors, found_errors))
E           AssertionError: Critical errors found: [('node1', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.']), ('node2', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.']), ('node3', ['scylla: seastar/include/seastar/core/gate.hh:144: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"\' failed.', 'Aborting on shard 0.'])]
E           Other errors: []

@bhalevy are you aware of those type of failures ?

Apparently these are due to issue in https://github.com/scylladb/scylladb/pull/15213 branch

fruch commented 1 year ago

this happened 5 times, since 10 Aug: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/98ae1610-4b1b-11ee-b96a-256b7e8b2261

I think it's enough time to take it out of the gate

mykaul commented 1 year ago

@fruch - is it a recent addition to gating?

fruch commented 1 year ago

@fruch - is it a recent addition to gating?

it was added to the gating in the big batch added in a month ago: https://github.com/scylladb/scylla-dtest/pull/3452

mykaul commented 1 year ago

@fruch - is it a recent addition to gating?

it was added to the gating in the big batch added in a month ago: scylladb/scylla-dtest#3452

So indeed a reasonable candidate to be taken out.

nyh commented 1 year ago

I looked a the test. My guess is that this is a (very minor) Scylla bug, not a test problem:

This test (test_populate_mv_after_insert_wide_rows) does:

  1. It creates a cluster with 3 nodes, RF=1.
  2. It then builds a table with 5 partitions of 1000 rows each.
  3. It then adds a view to this table, waits for it to be built.
  4. Then it for each of the 5000 rows, it reads it from the view, asserting it's there.

The good news is that all of this passes successfully... The view gets built and read just fine. The only problem is that the test finds a message in the log:

ERROR 2023-08-14 07:32:54,996 [shard 0] view - Error applying view update to 127.0.53.3 (view: ks.t_by_v, base token: -4069959284402364209, view token: 8209703677368372043): exceptions::mutation_write_failure_exception (Operation failed for ks.t_by_v - received 0 responses and 1 failures from 1 CL=ONE.

This error must have been a transient error during the view building, as eventually we were able to read all the data so the view building must have finished.

The error mentions tokens, which can be translated to keys id=1 and v=34:

>>> from cassandra.metadata import Murmur3Token
>>> Murmur3Token.hash_fn(int.to_bytes(1, length=4))
-4069959284402364209

>>> for i in range(10000):
...     if Murmur3Token.hash_fn(int.to_bytes(i, length=4)) == 8209703677368372043:
...         print(i)
... 
34
>>> Murmur3Token.hash_fn(int.to_bytes(34, length=4))
8209703677368372043

I don't know what is special in this particular row and can't find the Scylla log file (@fruch, can you help me) to see if there are any other error messages that happened during the view build. In particular, this error message is rate limited - so it might be the first of 100 messages (e.g. all ids between 0 and 100, an 34 just happened to be printed first). Or not, I don't know.

I suspect that the view building on node 1 experienced a transient problem which prevented the view update from being sent to the view replica on node 3 (note that because N=3, RF=1, view replicas are typically on a different node on this test). I can't be sure why it failed without finding the logs of node 3 (@fruch maybe you can help me), but I have a guess, that will need to proved or discredited:

View building starts on node 1 as soon as node 1 learns that the CREATE MATERIALIZED VIEW has happened. However, it is conceivable that when node 1 starts to send view updates, and decides it should send some of them to node 3, node 3 doesn't yet know about this view's existence so the write fails to node 3. The view building will later try again, and by that time node 3 will have received the updated schema, and the second write will succeed and the view building will proceed.

If this explanation is correct, we can modify the view building so that it doesn't start until all live nodes are aware of the new view. Perhaps waiting for schema agreement is the right way to go. Alternatively, we should consider not reporting errors during view build as "ERROR"s, since the view building retries anyway, and these should be considered INFO at most, if at all.

CC @cvybhu @eliransin

fruch commented 1 year ago

we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else.

@benipeled can we enlarge the dtest gating job history a bit ?

meanwhile there are similar tests with MVs that having similar failures here's one: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/testReport/junit/concurrent_schema_changes_test/TestConcurrentSchemaChanges/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_create_lots_of_mv_concurrently/

and a link to the dashboard show similar ones (i.e. that have the same error mentioned in thier failure): https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/557f8e00-4d96-11ee-b96a-256b7e8b2261

nyh commented 1 year ago

meanwhile there are similar tests with MVs that having similar failures here's one: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/testReport/junit/concurrent_schema_changes_test/TestConcurrentSchemaChanges/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_create_lots_of_mv_concurrently/

Thanks, this one indeed looks similar. This one also has an error in node1 trying to send view updates to node2. The node1 log https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/artifact/logs-full.release.010/1693892699459_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_create_lots_of_mv_concurrently/node1.log has:

INFO  2023-09-05 05:43:45,833 [shard 0:stre] schema_tables - Creating lots_o_views.src_by_c2 id=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d version=dae806bf-19ea-3f75-99c8-037d25edb1d6
INFO  2023-09-05 05:43:45,835 [shard 0:stre] view - Building view lots_o_views.src_by_c2, starting at token minimum token
ERROR 2023-09-05 05:43:45,836 [shard 0:stre] view - Error applying view update to 127.0.75.2 (view: lots_o_views.src_by_c2, base token: -8978909661410770097, view token: 1287679541648494871): exceptions::mutation_write_failure_exception (Operation failed for lots_o_views.src_by_c2 - received 0 responses and 1 failures from 1 CL=ONE.)
WARN  2023-09-05 05:43:45,836 [shard 0:stre] view - Error executing build step for base lots_o_views.source_data: exceptions::mutation_write_failure_exception (Operation failed for lots_o_views.src_by_c2 - received 0 responses and 1 failures from 1 CL=ONE.)
INFO  2023-09-05 05:43:47,016 [shard 0:stre] view - Finished building view lots_o_views.src_by_c2

As I assumed, the view building on node started 2 milliseconds after the MV was created on node 1, 2023-09-05 05:43:45,835. Just a millisecond later, the first write to node 2 failed. Looking at the node 2 log https://jenkins.scylladb.com/job/scylla-master/job/dtest-release-with-raft/245/artifact/logs-full.release.010/1693892699459_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_create_lots_of_mv_concurrently/node2.log we see:

INFO  2023-09-05 05:43:45,829 [shard 0:stat] migration_manager - Create new view: org.apache.cassandra.config.CFMetaData@0x600001130400[cfId=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d,
INFO  2023-09-05 05:43:45,834 [shard 0:stre] schema_tables - Altering lots_o_views.source_data id=1fc41c10-4baf-11ee-908b-f711dd34e11d version=aaedc4a9-12ec-3263-b5d9-c91e3f2e11f9
INFO  2023-09-05 05:43:45,834 [shard 0:stre] schema_tables - Creating lots_o_views.src_by_c2 id=2e01bd50-4baf-11ee-988f-c6c02f6a2b6d version=dae806bf-19ea-3f75-99c8-037d25edb1d6
WARN  2023-09-05 05:43:45,836 [shard 0:stat] storage_proxy - Failed to apply mutation from 127.0.75.1#0: data_dictionary::no_such_column_family (Can't find a column family with UUID 2e01bd50-4baf-11ee-988f-c6c02f6a2b6d)

I think this confirms my guess - node 2 receives view updates two milliseconds after learning of the view, and apparently it's not yet fully aware of this very recently-added table and fails to find it.

nyh commented 1 year ago

As this is a (minor) Scylla bug, not a dtest flakiness, I'll change the issue title. I'll include here the failed dtest names for searchability:

mykaul commented 1 year ago

@nyh - the retry mechanism is based on hints, btw? I think it's very interesting to figure out the reason for the transient failure - maybe will give us a clue wrt inconsistent MVs.

nyh commented 1 year ago

I think I explained above the reason for the transient failure - node 1 started view building of pre-existing data, sent a view update to node 2 - just before node 2 fully learned about the view existing so it refuses the write. IIRC there are no hints involved here - view building remembers its position and what it needs to resend. Also note that the view is fully consistent at the end - the only "bug" here is the log spam.

bhalevy commented 1 year ago

I think I explained above the reason for the transient failure - node 1 started view building of pre-existing data, sent a view update to node 2 - just before node 2 fully learned about the view existing so it refuses the write. IIRC there are no hints involved here - view building remembers its position and what it needs to resend. Also note that the view is fully consistent at the end - the only "bug" here is the log spam.

Maybe in this particular case, but I'm concerned about not properly synchronizing with schema changes. See https://github.com/scylladb/scylladb/issues/15048#issuecomment-1677241939 (and the issues look duplicated)

fruch commented 1 year ago

if so @nyh, can change those log print into a warning ? so we won't need to filter them all over the place ?

mykaul commented 1 year ago

@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.

fruch commented 1 year ago

@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.

it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them

mykaul commented 1 year ago

@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.

it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them

We haven't had those failures before - or have we? We haven't changed the logging, to the best of my knowledge. If it's a benign issue - even better, we can fix it on the side branch, then re-apply either the test, the fix or both and get them back into gating. Right now, it's (yet another) source for gating flakiness.

benipeled commented 1 year ago

we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else.

@benipeled can we enlarge the dtest gating job history a bit ?

Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60

fruch commented 1 year ago

we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?

Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60

@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.

fruch commented 1 year ago

@benipeled

https://github.com/scylladb/scylla-dtest/pull/3567 it was by number of jobs, and the day for the artifacts doesn't really helps, since we keep the links to the job, and not to specific artifacts.

nyh commented 1 year ago

We haven't had those failures before - or have we? We haven't changed the logging, to the best of my knowledge. If it's a benign issue - even better, we can fix it on the side branch, then re-apply either the test, the fix or both and get them back into gating. Right now, it's (yet another) source for gating flakiness.

I think it's perhaps not a common occurrence, the view build on the first node races with the second node learning that the view exists. Perhaps the schema-changing usually wins the race. I don't know why it has become (if at all) more common than it was in the past. Moreover, this problem only applies to tests that do view building (adding a view to a table with pre-existing data) - many MV tests add data to the view after it's created and are not vulnerable to this issue.

benipeled commented 1 year ago

we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?

Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60

@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long

We can keep the job themself for long time but it means console only which is not enough

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.

What about this build? it has all artifacts

fruch commented 1 year ago

we don't have the log of all those occurrences any more, we don't persist all of the jobs for too long, and in dtest case we don't move them anywhere else. @benipeled can we enlarge the dtest gating job history a bit ?

Artifacts archived for 240 days - should be enough in this case but feel free to adjust it as needed if you think it's required https://github.com/scylladb/scylla-dtest/blob/next/vars/dtestPipeline.groovy#L59-L60

@benipeled and the jobs themself ? I think in the rate we have for it, they doesn't last long

We can keep the job themself for long time but it means console only which is not enough

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/4176/ is from 9 days ago. we need much more jobs history on that one, I would say 3 months.

What about this build? it has all artifacts

yeah, when there no other errors causing it not to have it :)

mykaul commented 1 year ago

@fruch - did we take this out of gating? @nyh - is it reasonable to change this ERROR to WARN?

fruch commented 1 year ago

@fruch - did we take this out of gating? @nyh - is it reasonable to change this ERROR to WARN?

yes, cause of those kind of issues, all of those test we taken out of gating https://github.com/scylladb/scylla-dtest/commit/17f9d015868ffb6fb27595099ce63dcc502832c7

fruch commented 1 year ago

we are seeing this a lot in SCT as well:

2023-09-30 06:39:50.379 <2023-09-30 06:39:50.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=04f44e1e-ae85-4253-9ac6-f6f3556af219: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=6245 node=longevity-large-partitions-200k-pks-db-node-d62e8bac-0-3
2023-09-30T06:39:50+00:00 longevity-large-partitions-200k-pks-db-node-d62e8bac-0-3      !ERR | scylla[6843]:  [shard 11:stre] view - (rate limiting dropped 26 similar messages) Error applying view update to 10.142.0.111 (view: scylla_bench.view_test, base token: 3520969294134539810, view token: 8659755778285643669): exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)

repeating @mykaul, can we change those print level ?, or we need to a add code to ignore all over the place ?

fruch commented 1 year ago

I hundreds of those errors prints in a case which mostly do topology changes on a multi-dc cluster:

2023-10-02 17:01:44.225 <2023-10-02 17:01:44.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=ec00c543-bb0f-484d-9800-bc383f0c5f15: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2187 node=parallel-topology-schema-changes-mu-db-node-7162315c-3
2023-10-02T17:01:44+00:00 parallel-topology-schema-changes-mu-db-node-7162315c-3      !ERR | scylla[5623]:  [shard 0:stat] view - Error applying view update to 10.4.9.255 (view: keyspace1.standard1_c3_nemesis_index, base token: 8687565012191682492, view token: -783736015806138877): data_dictionary::no_such_column_family (Can't find a column family with UUID 5d06aba1-6145-11ee-9092-ffeb434a7910)

when should we expect those error ?, it's not just on index creation, and seems like it might happen any time a new is added. or a node is stopped.

Installation details

Kernel Version: 5.15.0-1045-aws Scylla version (or git commit hash): 5.4.0~dev-20231002.1640f83fdc31 with build-id d751f4ab981bda6f045bb902f0d341e61f0ac3a7

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0000f82a29037b49d ami-0205476ccb2b99643 (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test Test id: 7162315c-e5dc-41a6-940d-13e7997f7162 Test name: scylla-master/longevity/longevity-multidc-schema-topology-changes-12h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 7162315c-e5dc-41a6-940d-13e7997f7162` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7162315c-e5dc-41a6-940d-13e7997f7162) - Show all stored logs command: `$ hydra investigate show-logs 7162315c-e5dc-41a6-940d-13e7997f7162` ## Logs: - **db-cluster-7162315c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/db-cluster-7162315c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/db-cluster-7162315c.tar.gz) - **sct-runner-events-7162315c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/sct-runner-events-7162315c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/sct-runner-events-7162315c.tar.gz) - **sct-7162315c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/sct-7162315c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/sct-7162315c.log.tar.gz) - **loader-set-7162315c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/loader-set-7162315c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/loader-set-7162315c.tar.gz) - **monitor-set-7162315c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/monitor-set-7162315c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7162315c-e5dc-41a6-940d-13e7997f7162/20231003_171218/monitor-set-7162315c.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-multidc-schema-topology-changes-12h-test/15/) [Argus](https://argus.scylladb.com/test/4d9b162e-c35a-41e6-a0ac-d677a64c26ac/runs?additionalRuns[]=7162315c-e5dc-41a6-940d-13e7997f7162)
cvybhu commented 1 year ago

I think we can change the log level to WARN, a failing view update isn't a fatal event. Opened: https://github.com/scylladb/scylladb/pull/15640

kostja commented 11 months ago

@nyh view updates should be signed with schema version just like normal writes. And schema version fencing should work. Why doesn't it?

dani-tweig commented 8 months ago

@nyh Do you think it would be enough to change the log level to WARN? or do we need to investigate what @kostja recommended, to find out why version fencing didnt work this time?

nyh commented 8 months ago

I don't know what Kostja meant about schema versions and fencing (he may be right, but I'm just not experienced enough in this area), so I'll use simpler words:

It is pointless for any node to start its view building process before it is sure that all nodes know of the new view, because this node will surely need to send updates to every other nodes - and it will fail if the other nodes don't know on the view building.

So I think the real fix is to delay the start of view building a bit - until schema agreement, and I think we should keep this open until we do this.

I don't know what is the right "process" to wait until all the nodes know of the new view, and whether we should wait for every single node (in any case, view building is likely to pause if it can't reach one of the nodes, so the view building process does depend on no node being down).

nyh commented 7 months ago

@fruch - remove from gating please. Lets continue the dissection and analysis in parallel.

it's all the materialized view tests, that are gonna be taken out of gating, what mentioned here can happen on any of them

I missed that this was done. In my analysis above I thought that only a couple of specific tests had this problem - not all of them... So,... Now it turns out for a few months (since September), all of the materialized views tests have been off gating. That's not good. So I guess we need to fix this issue once and for all to get the tests back to gating! I see that @kostja assigned this issue to @wmitros?

In comment https://github.com/scylladb/scylladb/issues/15046#issuecomment-1932642793 I suggested what I think the "best" thing to do would be - to delay view building, but perhaps a simpler and good-enough fix would be to downgrade the ERROR to WARN, considering that the view building is already coded to ignore these errors and try again (which is why it works correctly, besides printing those scary errors).