scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13k stars 1.24k forks source link

tablets: Shard bouncing should not assume that shards don't change #15465

Open tgrabiec opened 10 months ago

tgrabiec commented 10 months ago

Shard bouncing (used for LWT requests), is handled like this:


template<typename Process>
future<cql_server::result_with_foreign_response_ptr>
cql_server::connection::process_on_shard(::shared_ptr<messages::result_message::bounce_to_shard> bounce_msg, uint16_t stream, fragmented_temporary_buffer::istream is,
        service::client_state& cs, service_permit permit, tracing::trace_state_ptr trace_state, Process process_fn) {
    return _server.container().invoke_on(*bounce_msg->move_to_shard(), _server._config.bounce_request_smp_service_group,
            [this, is = std::move(is), cs = cs.move_to_other_shard(), stream, permit = std::move(permit), process_fn,
             gt = tracing::global_trace_state_ptr(std::move(trace_state)),
             cached_vals = std::move(bounce_msg->take_cached_pk_function_calls())] (cql_server& server) {
        service::client_state client_state = cs.get();
        return do_with(bytes_ostream(), std::move(client_state), std::move(cached_vals),
                [this, &server, is = std::move(is), stream, process_fn,
                 trace_state = tracing::trace_state_ptr(gt)] (bytes_ostream& linearization_buffer,
                    service::client_state& client_state,
                    cql3::computed_function_values& cached_vals) mutable {
            request_reader in(is, linearization_buffer);
            return process_fn(client_state, server._query_processor, in, stream, _version,
                    /* FIXME */empty_service_permit(), std::move(trace_state), false, std::move(cached_vals)).then([] (auto msg) {
                // result here has to be foreign ptr
                return std::get<cql_server::result_with_foreign_response_ptr>(std::move(msg));
            });
        });
    });
}

It assumes that shard which we got in the bounce response will be the correct shard the second time:

                // result here has to be foreign ptr
                return std::get<cql_server::result_with_foreign_response_ptr>(std::move(msg));

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

gleb-cloudius commented 4 months ago

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

Did you see it happening? Doesn't migration suppose to wait for all ongoing requests to complete before moving? And last but not least do we support moving a tablet from shard to shard on the same node?

tgrabiec commented 4 months ago

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

Did you see it happening?

No.

Doesn't migration suppose to wait for all ongoing requests to complete before moving?

This jumping happens in the CQL layer where we don't hold erm around it, so it can escape barriers.

And last but not least do we support moving a tablet from shard to shard on the same node?

Not yet, but we will. But even now, it's not guaranteed that the CQL coordinator runs on the tablet-owning node, so it can migrate into the coordinator node.

gleb-cloudius commented 4 months ago

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

Doesn't migration suppose to wait for all ongoing requests to complete before moving?

This jumping happens in the CQL layer where we don't hold erm around it, so it can escape barriers.

That's true. But we can fix it. For instance we can put erm into bonce error struct.

And last but not least do we support moving a tablet from shard to shard on the same node?

Not yet, but we will. But even now, it's not guaranteed that the CQL coordinator runs on the tablet-owning node, so it can migrate into the coordinator node.

We run drain on all nodes.

nyh commented 4 months ago

Unfortunately, Alternator may have exactly the same problem as CQL because it (see alternator/executor.cc) also uses cas_shard() and forwards it to this shard. So whatever solution is used for CQL (I see there's an open PR #17309 by @bhalevy that claims to Fixes this issue), we'll need to do it for Alternator as well. I'll open a separate issue for that.

nyh commented 4 months ago

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

What if not only the shard changes - the tablet was entirely moved off this node, and the node doesn't own it any more? What will happen then? In the past, I believed (maybe I misunderstood?) that it was fine to run an LWT request on a coordinator node that doesn't own a replica of the data, as long as it was run on the correct shard (the "correct" shard was a function of the token, even if this node has no replica for this token). Is this no longer true? Is it now forbidden to run an LWT request on a coordinator node that doesn't contain a replica of the involved tablet?

nyh commented 4 months ago

To emphasize my last question: Is it possible that with tablets, non-topology-aware drivers (which send the request to a random node, not one known to host the relevant tablet) fail with LWT? Opened an Alternator version of this issue in https://github.com/scylladb/scylladb/issues/17399. In Alternator, the question of non-topology-aware drivers is even more critical, because Alternator is never topology-aware: Requests arrive in arbitrary nodes, not necessarily the right nodes (let alone the right shards).

tgrabiec commented 4 months ago

On Mon, Feb 19, 2024 at 1:22 PM nyh @.***> wrote:

To emphasize my last question: Is it possible that with tablets, non-topology-aware drivers (which send the request to a random node, not one known to host the relevant tablet) fail with LWT? Opened an Alternator version of this issue in #17399 https://github.com/scylladb/scylladb/issues/17399. In Alternator, the question of non-topology-aware drivers is even more critical, because Alternator is never topology-aware: Requests arrive in arbitrary nodes, not necessarily the right nodes (let alone the right shards).

We develop with the goal in mind that drivers don't have to be tablet-aware for correctness, only performance.

Whether LWT depends on it for correctness, I'm not sure. Why do we require the LWT coordinator to run on the key-owning shard @Gleb Natapov @.***> ?

Message ID: @.***>

tgrabiec commented 4 months ago

On Mon, Feb 19, 2024 at 12:03 PM Gleb Natapov @.***> wrote:

But with tablets, shard may change in between if tablet is migrated. This can cause std::bad_variant_access to be thrown.

Doesn't migration suppose to wait for all ongoing requests to complete before moving?

This jumping happens in the CQL layer where we don't hold erm around it, so it can escape barriers.

That's true. But we can fix it. For instance we can put erm into bonce error struct.

True, but there's no mechanism for passing erm ptr across shards yet. Maybe Benny's approach to retry in this rare event is good enough?

And last but not least do we support moving a tablet from shard to shard on the same node?

Not yet, but we will. But even now, it's not guaranteed that the CQL coordinator runs on the tablet-owning node, so it can migrate into the coordinator node.

We run drain on all nodes.

That would help if we held erm ptr in the CQL layer around the whole operation.

Message ID: @.***>

gleb-cloudius commented 4 months ago

Why do we require the LWT coordinator to run on the key-owning shard

We introduced bounce for efficiency, but since with had bounce we could guaranty that the storage proxy code runs on the correct shard. I cannot guaranty that it will run correctly (but less efficient) if executed on different shard (hence the check). There was not such requirement when the code was written and there was no tests running in such setup. FWIW I think that thing that should be fixed is tablet locking moving up. Tablets have to try hard to preserve "same shard" property for data for performance.

avikivity commented 4 months ago

An alternative to checking that shards don't change, we can hold effective_replication_map_ptr and so prevent migrations.

mykaul commented 3 months ago

Is this still a work item for 6.0?

tgrabiec commented 3 months ago

We decided to not support LWT in 6.0, so this is not necessary.

mykaul commented 3 months ago

@kostja ^^^

kostja commented 3 months ago

@tgrabiec title has no mention of LWT; no test case; the code that is in the master today has no check for the tablet sharder, so will be quietly misbehaving in rare cases. An exception needs to be added to the current code and a documentation entry needs to be provided about the limitations if we want to ship.

And frankly I don't see how we can ship tablets as the default sharder without LWT :/

nyh commented 3 months ago

And frankly I don't see how we can ship tablets as the default sharder without LWT :/

I thought this was already decided (by @avikivity) and this is why starting with https://github.com/scylladb/scylladb/pull/17318 we warn on every keyspace creation that you may want to re-create the keyspace without tablets if you want to use CDC or LWT. I'm not happy about this decision either - it also means Alternator will not really work correctly with tablets, although it does enable tablets by default.

bhalevy commented 3 months ago

@tgrabiec is this issue a release blocker with https://github.com/scylladb/scylladb/pull/18026?

kbr-scylla commented 3 months ago

Given https://github.com/scylladb/scylladb/issues/18066, which will prevent us from entering this logic altogether, I think this can be moved to 6.1

kbr-scylla commented 3 months ago

But https://github.com/scylladb/scylladb/issues/18066 should be a blocker

tgrabiec commented 3 months ago

@tgrabiec is this issue a release blocker with #18026?

18026 is orthogonal. If we don't support LWT then #18026 doesn't need this issue.

tgrabiec commented 3 months ago

@tgrabiec title has no mention of LWT;

The description mentions it though. We also have a label. Isn't that enough?

no test case;

We don't require adding test cases when opening issues. If we did, then opening issues would be too hard. There's value in opening the issue because we can at least track the problem and decide what to do with it. Like we do now.