scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.49k stars 1.28k forks source link

LWT: when updating the same partition concurrently, some requests immediately return a timeout. #16261

Open Felix-zhoux opened 10 months ago

Felix-zhoux commented 10 months ago

When executing lwt writes to the same partition concurrently, some requests will report an error timeout within a short period of time, and the actual time is much shorter than the timeout set in the configuration file. (scylla version 5.1.15)

{"__type":"com.amazonaws.dynamodb.v20120810#InternalServerError","message" :"Internal Server Error: exceptions::mutation write timeout exception (Operation timed out for alternator multipart.multipart - received only 1 responses from 2 CL=SERIAL.)"}

Judging from the log printed by the client, the request only took 7ms on the server, which was far less than the 2s timeout set in the configuration file.

Info=AcquireCurlHandle:2023-11-30 07:05:19.273(0) Transmission: 2023-11-30 07:05:19.280(7)

From the monitoring, we can find that the cluster is processing in a normal state, and the load and latency are very low. So when using lwt to concurrently update(Actually it is putitem with return_values="ALL_OLD") the same partition, is it expected behavior that some requests immediately return a timeout?

avikivity commented 10 months ago

Can you create a reproducer?

Felix-zhoux commented 10 months ago

@avikivity Here are the steps to reproduce:

  1. Set up a cluster and enable Alternator (in my environment it is a 3AZ cluster, I think this is not necessary)
  2. Execute the following script and specify multiple endpoints (multiple nodes are required)
$ cat test_lwt_write.py 
import boto3,logging
from boto3.session import Session
import os,time
import argparse

parser = argparse.ArgumentParser(description='Test concurrency lwt put.')
parser.add_argument('--dynamodbtable', type=str, help='The test table.', required=True)
parser.add_argument('--endpoint', type=str, nargs='+', help='The endpoints to connect, eg. "http://10.167.139.15:8000"', required=True)
parser.add_argument('--thread', type=int, default=10, help='The num of threads.')

args = parser.parse_args()
table_name = args.dynamodbtable
ents = args.endpoint
thd = args.thread

config = boto3.session.Config(connect_timeout=1000, read_timeout=30000, retries={'max_attempts': 0}, signature_version="s3v4")
session = Session("fake", "fake")
dynamodb_client_ft = session.client('dynamodb', endpoint_url=ents[0], config=config, region_name="us-west-2")
dynamodb_client_ft.create_table(
    AttributeDefinitions=[
        {
            'AttributeName': 'obj',
            'AttributeType': 'S'
        },
        {
            'AttributeName': 'bi',
            'AttributeType': 'S'
        },
    ],
    TableName=table_name,
    KeySchema=[
        {
            'AttributeName': 'obj',
            'KeyType': 'HASH'
        },
        {
            'AttributeName': 'bi',
            'KeyType': 'RANGE'
        },
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 123,
        'WriteCapacityUnits': 123
    },
)

from concurrent.futures import ThreadPoolExecutor

def task(t):
    start_time = time.time()
    try:
        dynamodb_client = session.client('dynamodb', endpoint_url=ents[t % len(ents)], config=config, region_name="us-west-2")
        put_response = dynamodb_client.put_item(
            TableName=table_name,
            Item={
                'obj':{'S':'o1'},
                'bi':{'S':'bi'+'%s' %(str(t))}
            },
            ReturnValues='ALL_OLD'
        )
        return put_response
    except Exception as e:
        msg = "ERROR: thread=%s cost=%s err=%s" % (t, (time.time() - start_time), e.message)
        return msg

executor = ThreadPoolExecutor(thd)

tasks = []
for t in range(1, thd):
    future = executor.submit(task, t)
    tasks.append(future)

for f in tasks:
    print(f.result())

The execution example is as follows:

$ python test_lwt_write.py --dynamodbtable test_lwt_write --endpoint "http://10.249.141.79:8000" "http://10.249.141.80:8000" "http://10.249.141.81:8000" "http://10.249.141.82:8000" "http://10.249.141.83:8000" "http://10.249.141.84:8000" --thread 30
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
ERROR: thread=14 cost=0.12740278244 err=An error occurred (InternalServerError) when calling the PutItem operation (reached max retries: 0): Internal server error: exceptions::mutation_write_timeout_exception (Operation timed out for alternator_test_lwt_write.test_lwt_write - received only 2 responses from 5 CL=SERIAL.)
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
ERROR: thread=20 cost=0.108790874481 err=An error occurred (InternalServerError) when calling the PutItem operation (reached max retries: 0): Internal server error: exceptions::mutation_write_timeout_exception (Operation timed out for alternator_test_lwt_write.test_lwt_write - received only 1 responses from 5 CL=SERIAL.)
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:37 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}
{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 05 Dec 2023 10:34:38 GMT', 'content-length': '2', 'content-type': 'application/x-amz-json-1.0', 'server': 'Seastar httpd'}}}

The larger the threads, the easier it is to reproduce. It can be found that the cost of the error-reported request is much less than the timeout setting.

avikivity commented 10 months ago

@gleb-cloudius please try it out

gleb-cloudius commented 10 months ago

@gleb-cloudius please try it out

It is an expected behaviour in case of concurrent updates. Cassandra fixed it to return separate exception a while ago: https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

nyh commented 10 months ago

@gleb-cloudius please try it out

It is an expected behaviour in case of concurrent updates. Cassandra fixed it to return separate exception a while ago: https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Can you explain why it's expected behavior? We know the requests would be linearized and as result can experience high latencies (see also https://github.com/scylladb/scylladb/issues/13078), but why would we get a timeout error before the timeout elapsed? Is there some hard-coded limit on the number of concurrent requests that may be queued?

nyh commented 10 months ago

By the way. @Felix-zhoux, in the DynamoDB API, the server may return an InternalServerError, in which case the client SDK will retry the same request, automatically. So if I understand correctly, the fact that these errors are happening, should not cause any problems on your application, except for higher latency because of the need to do this retry. Am I right?

gleb-cloudius commented 10 months ago

@gleb-cloudius please try it out

It is an expected behaviour in case of concurrent updates. Cassandra fixed it to return separate exception a while ago: https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Can you explain why it's expected behavior? We know the requests would be linearized and as result can experience high latencies (see also #13078), but why would we get a timeout error before the timeout elapsed? Is there some hard-coded limit on the number of concurrent requests that may be queued?

There is an explanation in the issue. Sometimes paxos simply do not know if a write failed or succeeded, so it cannot return success, but there is nothing to wait for either. In this case "some error" that will probably cause the application to retry is returned. Later Cassandra added special error type to convey what happened more precisely. We did not follow.

nyh commented 10 months ago

There is an explanation in the issue. Sometimes paxos simply do not know if a write failed or succeeded, so it cannot return success, but there is nothing to wait for either. In this case "some error" that will probably cause the application to retry is returned. Later Cassandra added special error type to convey what happened more precisely. We did not follow.

I had a long chat about this with Gleb. It turns out that although the DynamoDB SDK's retry of the request seems to solve the problem of the spurious failures (and this is what I told @Felix-zhoux above), this retry - actually any automated retry by the SDK and not just in this case - introduces a linearizability error:

Consider we have a value a=0 and concurrently run two read-modify-write operations:

If the application sees the second request succeeding, it can believe that from now on a=2 - because the request a=1 must have happened earlier (otherwise a=2 if a==1 would not have succeeded). However, with the SDK's retry we can end up with the second request succeeding but a=1:

  1. a=1 happens first, but Scylla gets confused because of the contention, and returns the fake-timeout and InternalServerError (other problems, like network failures, can cause a similar situation). The a=1 still happened!
  2. Now a=2 if a==1 runs and succeeds. a is set to 2.
  3. Now the SDK, which got the InternalServerError, retries the request a=1, it sets a to 1.

Arguably, we can't fix this problem with LWT or DynamoDB's read-modify-write API, because it is not idempotent - the request a=1 may run more than once, not once as linearizability wants to guarantee. The new Dynamodb "transactions" API solves the idempotency problem by including a "client token" in the request. But the regular RMW writes (or CQL's LWT) don't have this feature.

So I think that DynamoDB and Alternator users aware that requests are not idempotent and the SDK may cause them to happen more than once. I think this problem can affect DynamoDB just like it can affect us.

Felix-zhoux commented 10 months ago

By the way. @Felix-zhoux, in the DynamoDB API, the server may return an InternalServerError, in which case the client SDK will retry the same request, automatically. So if I understand correctly, the fact that these errors are happening, should not cause any problems on your application, except for higher latency because of the need to do this retry. Am I right?

Thanks for your reminder. @nyh
We will actively control the retry mechanism of the sdk. We need to be very cautious when retrying in a timeout scenario, because this may not necessarily solve the problem, but may increase the system load.

I think that if this situation cannot be avoided, exceptions::mutation_write_timeout_exception is not a reasonable error message, right? After all, this request only took milliseconds to return.

nyh commented 4 months ago

In a discussion with @gleb-cloudius he explained to me why this LWT "contention" problem, causing quick timeouts happens and why this problem will be common especially (or even only) in Alternator:

In CQL, it turns out we modified the Scylla CQL driver to always send LWT writes to the "primary owner" of the partition as the coordinator. This coordinator can serialize all the LWT writes to the same partition (i.e., do them one by one). If we would not do that, the CQL driver would send LWT writes to the same partition to RF different coordinators at the same time, several of those coordinators may try to start a ballot and notice a collision. When a collision is noticed, the different contenders retry the opeation up to cas_contention_timeout_in_ms (by default, 1000ms). If the retries fail until that 1 second timeout, the write fails with a "timeout" although the time that passed (1 second) is lower than the regular timeout (but it's not "immediate" as this issue claims).

To make things worse on Alterator, not only do we not send the LWT request to one primary owner of the partition, we actually run it on any one of N Scylla nodes, not just to RF owners of the partition (the AWS SDK is completely not topology aware, and doesn't send requests to the owner of the data (let alone primary owner). This makes the "contention" very likely - two writes to the same partition are very likely to arrive at different nodes, not the same node.

Perhaps the solution for Alternator to forward LWT writes to one primary owner, similar to how the CQL driver does it, and similar to how we already forward LWT requests inside one node to reach the right shard. We actually have an issue about this already: https://github.com/scylladb/scylladb/issues/5703 !

See also: https://github.com/scylladb/scylladb/issues/13078

nyh commented 4 months ago

When a collision is noticed, the different contenders retry the opeation up to cas_contention_timeout_in_ms (by default, 1000ms). If the retries fail until that 1 second timeout, the write fails with a "timeout" although the time that passed (1 second) is lower than the regular timeout (but it's not "immediate" as this issue claims).

By the way, there has to be a hole in this theory, because @Felix-zhoux in his original report saw the timeout just 7ms after the request, not a full second. Perhaps sometimes we forget to retry the contended LWT operation, and return a timeout immediately? I just wrote a very simple reproducer and can also see these "timeouts", but I still need to figure out how much time it takes for it. The error I see:

E           botocore.errorfactory.InternalServerError: An error occurred (InternalServerError) when calling the 
UpdateItem operation (reached max retries: 0): Internal server error: exceptions::mutation_write_timeout_exception 
(Operation timed out for alternator_alternator_Test_1719146865383.alternator_Test_1719146865383 - received only
1 responses from 2 CL=LOCAL_SERIAL.)
gleb-cloudius commented 4 months ago

When a collision is noticed, the different contenders retry the opeation up to cas_contention_timeout_in_ms (by default, 1000ms). If the retries fail until that 1 second timeout, the write fails with a "timeout" although the time that passed (1 second) is lower than the regular timeout (but it's not "immediate" as this issue claims).

By the way, there has to be a hole in this theory, because @Felix-zhoux in his original report saw the timeout just 7ms after the request, not a full second.

If timeout happens faster then any configures timeout the only explanation is that this is not a real timeout but an "uncertainty" error where paxos algorithm cannot figure out if current transaction succeeded or not. The question is can this condition happen due to contention.

nyh commented 4 months ago

The following test code fails on my laptop in about 1 in 5 runs, and demonstrates an update that fails after 18ms - not a one-second timeout - exactly like @Felix-zhoux suspected. The logs also show on the node that got the write that failed the TRACE-level message

paxos - CAS[0] accept_proposal: proposal is partially rejected

which as @gleb-cloudius explained to me, means that LWT reached an uncertainty situation where it immediately returns an error (and the error we return today is the regular timeout error).

The test is as follows (I need to clean it up before I propose it as a regression test). The test starts a 3-node cluster, starts 3 client threads, and sends three RMW updates (just one from each thread) to different items in the same partition. Two of the updates are sent to node 0, one update to node 1. Sometimes (around once in 5-10 runs), the update to node 1 fails with a "timeout" but the test prints that the time it took for this "timeout" is around 0.18 seconds - nothing close to a real timeout.


async def test_alternator_concurrent_rmw_same_partition(alternator3):
    manager, alternator, *_ = alternator3
    ips = [server.ip_addr for server in await manager.running_servers()]

    table = alternator.create_table(TableName=unique_table_name(),
        BillingMode='PAY_PER_REQUEST',
        KeySchema=[
            {'AttributeName': 'p', 'KeyType': 'HASH' },
            {'AttributeName': 'c', 'KeyType': 'RANGE' },
        ],
        AttributeDefinitions=[
            {'AttributeName': 'p', 'AttributeType': 'N' },
            {'AttributeName': 'c', 'AttributeType': 'N' },
        ])

    nthreads = 3
    # All threads write to one partition p, nitems different items (ntimes
    # should be much larger than nthreads, to ensure there is a low chance
    # of contention for one specific item)
    p = 1
    nitems = nthreads * 1000
    def run_rmw(i):
        print(f"hello {i}")
        rand = random.Random()
        rand.seed(i)
        alternators = [get_alternator(ip) for ip in ips]
        for n in range(1):
            alternator_i = rand.randrange(len(alternators))
            alternator = alternators[alternator_i]
            tbl = alternator.Table(table.name)
            c = rand.randrange(nitems)
            print(f'thread {i} incrementing item {p},{c} on node {alternator_i}')
            start = time.time()
            try:
                tbl.update_item(Key={'p': p, 'c': c},
                    UpdateExpression='SET v = if_not_exists(v, :init) + :incr',
                    ExpressionAttributeValues={':init': 0, ':incr': 1})
            except ClientError:
                print(f"In incrementing {p},{c} on node {alternator_i}: error after {time.time()-start}")
                raise

    threads = [ThreadWrapper(target=run_rmw, args=(i,)) for i in range(nthreads)]
    for t in threads:
        t.start()
    try:
        for t in threads:
            t.join()
    finally:
        table.delete()
nyh commented 4 months ago

Created a reproducer test (in "topology" framework) for this issue in https://github.com/scylladb/scylladb/pull/19445.

Note that this is an Alternator test. The root of the problem is in LWT, not in Alternator, but it's more likely to happen in Alternator because in CQL the driver tries to send all the updates of the same partition to the same coordinator, while in Alternator the updates get sent to random coordinators which causes this issue (if all the updates are sent to the same coordinator, this bug can't happen). The same problem can probably reproduced in CQL and LWT if prepared statements aren't used - because the sending-of-writes-to-same-partition-to-same-node only happens if prepared statements are used so the driver knows that this is an LWT request and what is its key.

nyh commented 4 months ago

Created a reproducer test (in "topology" framework) for this issue in #19445.

Note that it's surprisingly easy for this bug (or is it "not a bug it's a feature?"?) to happen in a realistic scenario. When the test in my pull request is asked to send just three updates to different items in the same partition, each update to a different node, in about 10% of the time one of the updates failed. When the test sends 450 updates, always many of them fail (the test stops after 3 failures, and almost always 3 do fail).

It can be argued that the driver will retry the failed request (boto3 defaults to three retries), but an application that relies heavily to updates to long partitions will probably see many of these failures and retries - and after enough updates, even see the driver report a failure after several failed retries.

So maybe we should consider doing https://github.com/scylladb/scylladb/issues/5703 as a fix for this issue.