twitchscience / kinsumer

Native Go consumer for AWS Kinesis streams.
Other
134 stars 35 forks source link

non stop ConditionalCheckFailedException error #30

Open jney opened 5 years ago

jney commented 5 years ago

Hello there, I'm now always getting the following error:

shard error (shardId-000000000005) in checkpointer.release: error releasing checkpoint: ConditionalCheckFailedException: The conditional request failed\n\tstatus code: 400, request id: RR9U1G5VFUCV6AHGT5DFJIPNJVVV4KQNSO5AEMVJF66Q9ASUAAJG

Any idea what's happened and how to fix that?

btw, I noticed #27 ticket

Thank you

Sytten commented 5 years ago

I also get those errors when new consumers start

cb-anthony-merrill commented 5 years ago

Anyone had any luck on this? We are hitting this as well. Otherwise, we're starting our deep dive investigation.

garethlewin commented 5 years ago

I have seen this but have never been able to successfully reproduce it. If you find any strong reproducible steps that would really be helpful.

Typically we see it so rarely that just restarting the process gets past it, but I am very aware that this is not idea.

cb-anthony-merrill commented 5 years ago

Hey @GarethLewin, question for you about checkpoints.go, there is a comment in the checkpointRecord struct that OwnerID, LastUpdateRFC and FinishedRFC are not used for decision making yet, a few lines below it; we are using it

if record.OwnerID != nil && record.LastUpdate > cutoff {

Maybe this is related? Or should it just be replaced with OwnerName; or does the original comment no longer apply?

garethlewin commented 5 years ago

I believe that is a mistake in checkpoints.go and OwnerID should be in the block of used variables and OwnerName should be in the "debug only" bit. I will verify this and update the code. From my quick check ownerName is never checked but ownerID is checked.

ckatsaras-godaddy commented 1 year ago

Just checking in if there was ever a solution found for this issue. I'm also experiencing this @jney, @Sytten , @garethlewin, @cb-anthony-merrill

garethlewin commented 1 year ago

No.

Still have never managed to get a reproducable situation, just a "sometimes it happens."

I've read through the code and debugged it multiple times and I cannot find a reason for this. I do not want to point at AWS/DDB, so my assumption is there is a bug in kinsumer, I just can't find it.

If you can reproduce or add any kind of diagnostic information beyond what has been shared, I'd love to fix this.

ckatsaras-godaddy commented 1 year ago

Ah, fair enough @garethlewin ! Thanks for the quick response! In terms of reproduction, what I've noticed is that it happens whenever I boot up new containers in ECS and the older containers are attempting to release(). I wonder if maybe the times I had configured for WithLeaderActionFrequency and WithShardCheckFrequency are causing issues (both set to 1 second which may be far to fast and causing issues)

ckatsaras-godaddy commented 1 year ago

Also, I think this is an issue when you have more containers than shards for a stream (e.g 3 containers consuming from 2 shards). In the case where all 3 containers deregistered together, 1 of those containers will not be the owner of a shard (which is to be expected) and thus, will fail DynamoDB query in release() since it never was the owner of any shard. Is this correct @garethlewin?

Sadly, I tested the theory above just now and it still seemed to show up once when I brought up 2 new containers and deregistered 2 old ones. The odd thing is, Kinsumer seems to still properly consume from my stream so I'm not actually sure if this really is an issue I should be too worried about

ckatsaras-godaddy commented 1 year ago

So after quite a bit of testing with ECS, I think I can provide a bit more context into what is happening @garethlewin .

My test includes 2 containers that are initially running in ECS and are consuming from a stream with 2 shards Given this configuration, the checkpoints table will have the following owner IDs

Shard OwnerID ...
Shard1 Container 1 ID ...
Shard2 Container 2 ID ...

Whenever I deploy new containers to ECS, what happens is 2 completely new containers are deployed and for a brief period of time, there are 4 containers running together. During this period of time, it appears that Container 3 and Container 4 are sometimes assigned ownership of the shards which makes the checkpoints table look like this:

Shard OwnerID ...
Shard1 Container 3 ID ...
Shard2 Container 4 ID ...

So, when ECS attempts to scale down to the desired 2 containers and deregister Container1 and Container2, it will fail to remove their ownership from the checkpoints table as they are no longer the owners

It's important to note that I can't seem to simulate this locally running multiple instances of my service but I can easily reproduce it in ECS. Given this information, do you have any idea why Kinsumer would be promoting the two new containers to be the owners of the shards prematurely?

Any help is really appreciated 😃

ckatsaras-godaddy commented 1 year ago

Sorry for all the messages @garethlewin , but I think I have some additional information that may be useful as well.

I'm wondering if sorting the clients by OwnerID in the getClients function (https://github.com/twitchscience/kinsumer/blob/master/clients.go#L123) https://github.com/twitchscience/kinsumer/blob/master/clients.go#L36-L38 is potentially problematic and aggravating this issue as well.

It looks like getClients is used within refreshShards and it seems like the order in which these clients are returned has a direct impact on whether or not a client is "promoted" or not https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L137-L157

So, with that being said, if the list is returned in a "new" order, it potentially leads to new clients being promoted and older clients no longer having their `ID associated with a checkpoint when the attempt to "release". It seems a bit odd to be sorting on a "random" UUID since it will lead to different outcomes each time (if I'm understanding correctly)

Let me know what you think of this and if there's any other information I can provide 😄

garethlewin commented 1 year ago

No worries about the comments, but I no longer maintain this package so I might not be super fast and responsive but I'll do my best.

I haven't read (or written) this code in several years so I need to knock the rust off, but the sorting by OwnerID is primary idea that I built this entire library around.

By having all clients all sort by the same field, we get a consistent understanding between all the clients over which shard gets allocated to which client. We all agree that shard 0 goes to the first client etc. When new clients are added to the system it is correct, a reallocation of shards will happen. But everyone is supposed to take a break and then move to the newly allocated shards.

Your ECS vs Local example is worrying to me. I wonder if there is a timestamp issue here. The older workers are supposed to give up consuming their shards and it seems like you are finding they are not giving it up.

Here is where everyone is supposed to check what shards they should consume:

https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L421

when we capture shard, we are supposed to not take one until the existing reader has said they are done with it

https://github.com/twitchscience/kinsumer/blob/master/shard_consumer.go#L75

I wonder if there is a race condition between releasing a shard and shutting down.

ckatsaras-godaddy commented 1 year ago

Thanks for all the details @garethlewin ! Okay, so it seems like sorting is to be expected here to "mix things up" and keep a healthy rotation. Yeah, it seems like there is some sort of issue where we basically are "prematurely" assigning new owners to shards before the older consumers can "release" their hold on it To be honest, I've done quite a bit of testing and since the error happens on shutdown anyways, I simply call kinsumer.Stop() to ensure we clear the client table, etc. Ideally I'd love to have no errors on shutdown but it doesn't seem to actually effect the functionality as it's basically just saying

"I couldn't find that I was the Owner of a checkpoint, so I can't clear my ownership"

Given that you no longer maintain the project, I really appreciate you going out of your way to help me 😄 !

garethlewin commented 1 year ago

No, the sorting isn't to mix things up. It's so that all the clients have consensus over who owns a shard, and who doesn't.

You should DEFINITELY call kinsumer.Stop() on shutdown or else your clients will hold onto the shards for longer than is needed.

ckatsaras-godaddy commented 1 year ago

Oh, I see! Sorry for the confusion!

Just to clarify, I wonder if it's possible that release() is written to "fail" on any failure to update, when in reality, there are scenarios where it isn't actually a failure

Like in the scenario above, if there are 4 consumers and only 2 shards, we are guaranteed to have at least 2 consumers that don't own a checkpoint. If I'm understanding correctly, there's nothing stopping the 2 consumers that don't have ownership from exiting, and if so, it isn't an error to clear their ownership since they don't have any.

I wonder if the improved logic for release() would be:

  1. Check if ownership exists
  2. Clear ownership IF applicable

What do you think of that @garethlewin ? 😄 https://github.com/twitchscience/kinsumer/blob/master/checkpoints.go#L203-L236

garethlewin commented 1 year ago

I think you might be on to something, there is an error for "No shard assigned". But I am not sure what happens in that case if the consumer had a shard before.

ckatsaras-godaddy commented 1 year ago

That's good to hear!

For the "No shard assigned" case, I think this section here https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L205-L207 covers the scenario where we have more consumers than shards which is why we wouldn't hit that particular error.

I think I'll look into creating a simple check for ownership and putting it in a PR for review. Once again, since you're not longer the maintainer of the project, don't feel you have to review, etc 😄

As I was saying earlier, now that we have the knowledge above about when the scenario occurs and how to safely clean up in this scenario, it's more about just removing the error so that others in the future aren't lead in the wrong direction