scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.67k stars 1.3k forks source link

Consistency issues #3557

Open jbdalido opened 6 years ago

jbdalido commented 6 years ago

Hey everyone, we've upgraded a dual-dc cluster in 2.1.5 from 2.0.4 and we have huge consistency issues. Wwe already had the issue with 2.0.4 and we tried upgrading to mitigate without success,

It's critical as it's a full production cluster

Scenarii:

1 - full upgrade + repair -pr on each nodes on every clusters 2 - selecting a row in local_quorum on dc1: row is present 3 - selecting the same row in local_quorum on dc2: row is not present 4 - selecting the same row in quorum : row is present 5 - re run repair -pr on every node 6 - return to 1 with another row same result 7 - script a full crawl to read every row in ALL to have the data repaired 8 - counting the rows in local quorum on each DC, 30k differences between the two, still missing rows checked manually.

Detail of the script:

Checking every row, and for each uuid checking that:

The result we have is 90k row on cluster B does not have its counterpart, and only 58k on the production DC. Something interesting, the data is supposedly missing from the cluster we are writing on (We use DC-2 in local_quorum write and reads) and we have more data in DC-1.

I've talked with @tgrabiec on Slack and we tried to clear the cache of the replicas on a missing row. After clearing the cache the data was back.

We've mitigated the issue for now by disabling all query made in local_* to quorum to ensure we have the data but we're not even sure it's working. Clusters are not overloaded, we have no issues whatsoever with timeouts or reads unavailable. The only problem we have is as soon as the cache builds up (a few days) we have every single nodes outputting the same error:

batchlog_manager - Exception in batch replay: exceptions::read_timeout_exception (Operation timed out - received only 0 responses.)

When this issue arise, our max reactor load per node is at 100/ once we restart the entire cluster, max is down to 20/25 steady until the cache builds up, then same issue again, we have this since 1.7.

Details

Installation details Scylla version (or git commit hash): Cluster size: 14 OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu

Hardware details (for performance issues)
Platform (physical/VM/cloud instance type/docker): gcloud / 10 cpu 70Ggb RAM Disks: (SSD/HDD, count): 2 local SSD (1 commitlog, 1 data )

tgrabiec commented 6 years ago

A for the secondary problem you mention with a batch replay timeout, it's most likely due to load on the reactor. System queries will not timeout starting from 2.2, so that error should be gone. There is still a question why you have the CPU overloaded and for that, we would have to get your prometheus metrics, and perhaps also a flame-graph of the overloaded CPUs.

avikivity commented 6 years ago

What compaction strategy do you use, for the table(s) that experience the issue?

tgrabiec commented 6 years ago

They use SizeTiered.

tgrabiec commented 6 years ago

It seems to be a case of rows resurrected after deletion.

I performed experiments with @jbdalido, and here's what we figured out. For some particular row, we found out using cqlsh queries that some replicas have it (replica A) and some don't (replica B). We also found out that clearing cache an all replicas does not bring the row back on replicas which don't have it. We determined that querying with CL=ALL returns the row, but doesn't heal replica B, which still returns no data when queried with CL=ONE. Using CQL tracing we determined that querying with CL=ALL seems to recognize the inconsistency and sends repairing mutations to replica B. Using trace-level database logging, we determined that the mutation is applied on replica B and contains the missing rows. Using trace-level cache logging during query we determined that the cache contains rows in question, but it has tombstones which cover the row. The tombstones are from April, so are long past the GC period by now and hence expired. This explains why replica B responds with no data, rows have been deleted. It is evident that replica A contains the original inserts but doesn't know about the deletion. Replica A will not learn about the deletion now, because expired tombstones are compacted away during mutation query.

One way we could end up in this situation is if replication of the deletion to replica A failed in the past, and nodetool repair was not executed within GC grace period (10 days) as per the manual (http://docs.scylladb.com/nodetool-commands/repair/). After this period passes, read-repair will no longer replicate the tombstone. It may also be dropped by replicas which still have it at any time by compaction. Refs #3560.

nodetool repair -pr may not be able to replicate the expired tombstone as well, in case the primary replica is not replica A and it no longer has relevant data for the expired tombstone, and repairs received from replica B get compacted into nothing before they are sent out to replica A, which is missing the deletion. Chances of propagation of expired tombstones can be increased by running naked nodetool repair without -pr on all nodes.

One thing which needs additional explanation is why for some rows we saw data back on some replicas after the cache was dropped. In this case, expired tombstones could have been already dropped from sstables by sstable compaction, but cache still contained them. Cache currently doesn't compact itself, it does so indirectly during eviction and repopulation from sstables. As long as cache contains the expired tombstone, it will keep dropping resurrecting repairs from replica A. When the cache is dropped and repopulated from sstables, the expired tombstone will no longer be there, and repairs from replica A will resurrect the row.

glommer commented 6 years ago

If we physically move the sstables and do nodetool refresh they should reappear no ?

On Fri, Jun 29, 2018, 4:31 PM Tomasz Grabiec, notifications@github.com wrote:

It seems to be a case of rows resurrected after deletion.

I performed experiments with @jbdalido https://github.com/jbdalido, and here's what we figured out. For some particular row, we found out using cqlsh queries that some replicas have it (replica A) and some don't (replica B). We also found out that clearing cache an all replicas does not bring the row back on replicas which don't have it. We determined that querying with CL=ALL returns the row, but doesn't heal replica B, which still returns no data when queried with CL=ONE. Using CQL tracing we determined that querying with CL=ALL seems to recognize the inconsistency and sends repairing mutations to replica B. Using trace-level database logging, we determined that the mutation is applied on replica B and contains the missing rows. Using trace-level cache logging during query we determined that the cache contains rows in question, but it has tombstones which cover the row. The tombstones are from April, so are long past the GC period by now and hence expired. This explains why replica B responds with no data, rows have been deleted. It is evident that replica A contains the original inserts but doesn't know about the deletion. Replica A will not learn about the deletion now, because expired tombstones are compacted away during mutation query.

One way we could end up in this situation is if replication of the deletion to replica A failed in the past, and nodetool repair was not executed within GC grace period (10 days) as per the manual ( http://docs.scylladb.com/nodetool-commands/repair/). After this period passes, repairing will no longer replicate the tombstone. It may also be dropped by replicas which still have it at any time by compaction. Refs

3560 https://github.com/scylladb/scylla/issues/3560.

One thing which needs additional explanation is why for some rows we saw data back on some replicas after the cache was dropped. In this case, expired tombstones could have been already dropped from sstables by sstable compaction, but cache still contained them. Cache currently doesn't compact itself, it does so indirectly during eviction and repopulation from sstables. As long as cache contains the expired tombstone, it will keep dropping resurrecting repairs from replica A. When the cache is dropped and repopulated from sstables, the expired tombstone will no longer be there, and repairs from replica A will resurrect the row.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/3557#issuecomment-401372051, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUNvXoDvtz3AYYa-Vb1enYbDgm-WGxDks5uBjopgaJpZM4U7QGk .

tgrabiec commented 6 years ago

@glommer If we wanted to replicate expired tombstones which haven't been compacted away in sstables yet, that would be an option. But it should have been already done by nodetool repair, which is bypassing cache, and not compacting anything.

glommer commented 6 years ago

But as you said yourself, it wouldn't replicated tombstones already expired. Moving the data physically would bypass this.

tgrabiec commented 6 years ago

Read on behalf of read-repair compacts, but looking at the code of nodetool repair, it uses streaming, which bypasses cache and does not compact. So nodetool repair should send expired tombstones if they're present in sstables.

In our case, we saw that replica B had the tombstone in sstables, because cache was populated with a tombstone after being cleared. So earlier repair should have propagated that tombstone to node A.

tgrabiec commented 6 years ago

I think the problem is with the -pr switch to nodetool repair, which is supposed to repair only the primary range of the current node.

It calls abstract_replication_strategy::get_primary_ranges() to calculate the ranges:

dht::token_range_vector
abstract_replication_strategy::get_primary_ranges(inet_address ep) {
    dht::token_range_vector ret;
    auto prev_tok = _token_metadata.sorted_tokens().back();
    for (auto tok : _token_metadata.sorted_tokens()) {
        auto&& eps = calculate_natural_endpoints(tok, _token_metadata);
        if (eps.size() > 0 && eps[0] == ep) {
            insert_token_range_to_sorted_container_while_unwrapping(prev_tok, tok, ret);
        }
        prev_tok = tok;
    }
    return ret;
}

It will select all ranges which have the current node as the first node in the list returned by calculate_natural_endpoints(). Looking at the implementation of network_topology_strategy::calculate_natural_endpoints(), it will generate the list by walking the ring, which contains nodes from both DCs. This means that for any given DC, there will be some ranges which aren't primary for any node in that DC, which is clearly wrong, and will cause same ranges to not be repaired.

I asked @jbdalido to run regular nodetool repair on node A, and resurrected data was dropped from it after that, which confirms that there is a problem with -pr.

It's likely that the inconsistency in tombstones arised due to this. See #3560.

\cc @nyh

nyh commented 6 years ago

@tgrabiec very good catch!

Is it really wrong that every range will be primary on just one node on the entire cluster (across all DCs)? Wouldn't a repair of this node involve the entire cluster, not just the DC, unless the user manually restricted it just to this DC?

Cassandra's nodetool documentation contains the following snippet: (https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html)

To restrict the repair to the local datacenter, use the -dc option followed by the name of the datacenter. Issue the command from a node in the datacenter you want to repair. Issuing the command from a datacenter other than the named one returns an error. Do not use -pr with this option to repair only a local data center.

So apparently if you use "-pr", you shouldn't restrict the repair only to one DC. Though indeed, it would be better to print an error when this combination is used, or better yet, change the meaning of "-pr" when "-dc" to find a primary owner in this dc for every range.

nyh commented 6 years ago

Ok, I think I found the bug... We have in repair/repair.cc the following code:

    } else if (options.primary_range) {
        rlogger.info("primary-range repair");
        // when "primary_range" option is on, neither data_centers nor hosts
        // may be set, except data_centers may contain only local DC (-local)
#if 0
        if (options.data_centers.size() == 1 &&
                options.data_centers[0] == DatabaseDescriptor.getLocalDataCenter()) {
            ranges = get_primary_ranges_within_dc(db.local(), keyspace);
        } else
#endif
#if 0
        if (options.data_centers.size() > 0 || options.hosts.size() > 0) {
            throw std::runtime_error("You need to run primary range repair on all nodes in the cluster.");
        } else {
#endif
            ranges = get_primary_ranges(db.local(), keyspace);
#if 0
        }
#endif

The "#if 0" code is important, and missing! What the commented out code means is that repair will check, and fail, if together with "-pr" the data center or the host list is restricted; However, it does allow restricting the repair to the local dc, but in that case calls the special function get_primary_ranges_within_dc() instead of the regular get_primary_ranges() - in this function should do exactly what @tgrabiec warned was not happening - assign a primary owner within the dc to each range.

nyh commented 6 years ago

@jbdalido although I think what I found above (based on @tgrabiec's analysis) is definitely a bug, to confirm that this is the bug that hit you - did you restrict the repair to one data center, with "-dc" or "-local" options?

nyh commented 6 years ago

I just sent a patch titled "repair: fix combination of "-pr" and "-local" repair options" to the mailing list which should fix this bug. @tgrabiec is you know how to reproduce this, you can test this patch. I'll also send in a minute a dtest patch which reproduces the bug as I understood it (nodetool repair with both "-pr" and "-local").

jbdalido commented 6 years ago

Hello @nyh , nope we used -pr and never in combination with local or dc.

nyh commented 6 years ago

So, so as much as it sounds exactly the same problem, the interaction of -pr and -local fixed in commit 3194ce16b3f1541919e784b11ee01df995eddf00 cannot explain this issue. @jbdalido did you remember to run the "-pr" repair on all nodes in the cluster, not just all nodes of one data-center? (without "-local", you need to run "-pr" on each and every node in the entire cluster, or you'll miss repairing some of the ranges). I'm trying to think of more bugs in "-pr"...

jbdalido commented 6 years ago

@nyh yes the -pr was used on every single node on the two data-centers.

nyh commented 6 years ago

I'm trying to review the code in network_topology_strategy::calculate_natural_endpoints() which the "-pr" feature is using. I still haven't found another bug. @jbdalido did you by any chance use the feature of "racks", or just data centers?

tgrabiec commented 6 years ago

@nyh Could it be that it's because we don't call invalidate_cached_rings() when a node is added, and abstract_replication_strategy::get_natural_endpoints() misses newly added neighbors?

@jbdalido Could it be that there was at least one node, in any DC, which was not restarted since the last time you added a new node to any of the clusters?

edit: It's probably not it, because you wrote that you upgraded the whole cluster before repairing. edi2: invalidate_cached_rings() is called indirectly when a node is added, so there is no problem with that.

jbdalido commented 6 years ago

@nyh here's an example of rackdc.properties, we're just leveraging the multi-dc, the rack is the same for everyone:

dc=us-central1-b
rack=1

@tgrabiec yes confirmed all our nodes were upgraded.

tgrabiec commented 6 years ago

I have an alternative explanation for why repair -pr may not heal some of the resurrected rows after tombstones become expired, which doesn't involve bugs in repair. Suppose we have three replicas with the following contents:

Let's assume also that replica C is the primary replica for the containing partition range, so it will always be the repair coordinator if -pr is used.

All replicas will mismatch on checksum, so replica C will fetch mutations from replicas A and B, merge with its local state, and then send out its current state to replicas A and B.

It can happen that when replica C received mutations from replica B, and maybe also from replica A, compaction was triggered before the send-out phase starts and the incoming mutations are compacted into nothing, before the send phase started. When replica C will get to stream mutations to replicas A and B, it will not have any mutations to stream, so nothing will change on replica A.

If we run repair without -pr on replica A, then replica A will get repaired.

nyh commented 6 years ago

@tgrabiec there's something I don't understand in the scenario you're describing: We're assuming the repair happens before the GC period from the deletion, right? (otherwise, we don't need elaborate scenarios to explain resurrections, we know they may happen). In that case, the result of compacting together the insertion and deletion will not be "nothing" but rather just a deletion.... And that should make all the other steps you described fine. Or am I missing something in the scenario you described?

tgrabiec commented 6 years ago

@nyh we know that in this case repair was not run in every GC grace period. We only need to explain why repair doesn't help if we already have expired tombstones

slivne commented 6 years ago

@tgrabiec - according to your explanation can't this happen also when a regular nodetool repair will be run (as compaction can happen before streaming back the data ?)

nyh commented 6 years ago

@tgrabiec I dont know why, but I understood previously that repair was run regularly (just with "-pr"), more frequently than GC grace period. If that is not the case then this is the quintessential reason for resurrection of deleted data, and I don't think there's any question why data was resurrected :-( Resurrection will happen not just with "-pr" repair - it can even happen without a repair at all: It's enough that you have two replicas, do an insert and delete, but one of the replicas missed the delete. Some time later, compaction in the first replica leaves only the delete, with only an insert in the second replica. 10 days (GC period) after the deletion, the deletion marker ("tombstone") is removed, and we are left with only an insertion in one of the replicas. At this point, the deleted data was already resurrected (a read which happens to consult this replica will return it, and cause an automatic read-repair of this entry). A full repair (with or without -pr) will just make this more obvious.

I think your example above is still very interesting, if you are curious how can it be that a repair without "-pr" didn't resurrect deleted data, but with "-pr" it did. I think it is ok in this case that "-pr" repair didn't solve the problem (which was inevitable) but the fact the non-"-pr" repair did solve it indicates maybe we could have done the "-pr" repair differently and by chance, also solve this case. What we could, and probably should (what do you think?), have done differently is to NOT do tombstone gc at all in the repair code's compaction. In your example, node C should not have compacted A and B's data to "nothing" but rather should have left the tombstone (which still exists - by chance - in node B).

Note that all of the above relies on chance. As soon as node B also compacts its sstables its older-than-GC tombstone will also be removed, and the data on node A will be resurrected, no matter what kind of repair you run or don't run afterwards.

tgrabiec commented 6 years ago

wt., 3 lip 2018 o 21:45 Shlomi Livne notifications@github.com napisał(a):

@tgrabiec https://github.com/tgrabiec - according to your explanation can't this happen also when a regular nodetool repair will be run (as compaction can happen before streaming back the data ?)

Yes, if you run it on replica C. But note that repair will be run on all replicas, so eventually on replica A as well, and then replica A will be repaired. With -pr, we will only ever repair on replica C.

tgrabiec commented 6 years ago

wt., 3 lip 2018 o 21:52 nyh notifications@github.com napisał(a):

@tgrabiec https://github.com/tgrabiec I dont know why, but I understood previously that repair was run regularly (just with "-pr"), more frequently than GC grace period. If that is not the case then this is the quintessential reason for resurrection of deleted data, and I don't think there's any question why data was resurrected :-( Resurrection will happen not just with "-pr" repair - it can even happen without a repair at all: It's enough that you have two replicas, do an insert and delete, but one of the replicas missed the delete. Some time later, compaction in the first replica leaves only the delete, with only an insert in the second replica. 10 days (GC period) after the deletion, the deletion marker ("tombstone") is removed, and we are left with only an insertion in one of the replicas. At this point, the deleted data was already resurrected (a read which happens to consult this replica will return it, and cause an automatic read-repair of this entry). A full repair (with or without -pr) will just make this more obvious.

Yes, but we're not answering the question of why resurrection originally happened here, we knew it happened because of infrequent repair.

The theory that there is something wrong with -pr was raised based on observation that even though some replicas still know about expired tombstones, repair doesn't propagate those tombstones to replicas which don't know about them, but repair without -pr does. See my comment:

Read on behalf of read-repair compacts, but looking at the code of nodetool
repair, it uses streaming, which bypasses cache and does not compact. So
nodetool repair should send expired tombstones if they're present in
sstables.

In our case, we saw that replica B had the tombstone in sstables, because
cache was populated with a tombstone after being cleared. So earlier repair
should have propagated that tombstone to node A.

The scenario I brought up is only about explaining that phenomenon, which I think merits an explanation, so that we know it's either an expected behavior or a potential bug in repair.

I think your example above is still very interesting, if you are curious how can it be that a repair without "-pr" didn't resurrect deleted data, but with "-pr" it did. I think it is ok in this case that "-pr" repair didn't solve the problem (which was inevitable) but the fact the non-"-pr" repair did solve it indicates maybe we could have done the "-pr" repair differently and by chance, also solve this case. What we could, and probably should (what do you think?), have done differently is to NOT do tombstone gc at all in the repair code's compaction. In your example, node C should not have compacted A and B's data to "nothing" but rather should have left the tombstone (which still exists - by chance - in node B).

Note that all of the above relies on chance. As soon as node B also compacts its sstables its older-than-GC tombstone will also be removed, and the data on node A will be resurrected, no matter what kind of repair you run or don't run afterwards.

I don't think we should try to make -pr deal with such cases of incorrect use. It's enough that we have regular repair, which when run on all nodes will propagate as much as possible.

nyh commented 6 years ago

But @tgrabiec as I explained in my long comment above, you can't count on this - if replica B did a compaction before the repair, it too would have lost the tombstone and then the repair - even without -pr - would not have un-resurrected the data.

tgrabiec commented 6 years ago

@nyh I'm not claiming we should count on this.

nyh commented 6 years ago

@tgrabiec thanks. By the way, since my comment above was long I want to ask again the short question I had there: Should we modify repair code to do its compaction (when reading sstables on the repair slave and when merging mutations on the repair master) without GC? This will not make the resurrection problem disappear, but it will at least make it more consistent (so repairing in different nodes or different orders won't make a noticable difference like happened here. If I understood correctly what happened here.).

tgrabiec commented 6 years ago

@nyh Currently reading done as part of repair (streaming in general) doesn't compact. I think in general it should, for efficiency (see https://github.com/scylladb/scylla/issues/3561). When we change it do the compaction, I think it'd be worthwhile to have a switch in repair which will disable it, to facilitate recovery from cases like this.

The second idea, to disable GC around repair, could also be worthwhile for recovery. However I am not sure it's worth it, given we have a workaround: to run naked repair on all nodes. Disabling GC while repairing would reduce efficiency, again due to the same reasoning as in https://github.com/scylladb/scylla/issues/3561.