Open jney opened 5 years ago
I also get those errors when new consumers start
Anyone had any luck on this? We are hitting this as well. Otherwise, we're starting our deep dive investigation.
I have seen this but have never been able to successfully reproduce it. If you find any strong reproducible steps that would really be helpful.
Typically we see it so rarely that just restarting the process gets past it, but I am very aware that this is not idea.
Hey @GarethLewin, question for you about checkpoints.go
, there is a comment in the checkpointRecord struct that OwnerID, LastUpdateRFC and FinishedRFC are not used for decision making yet, a few lines below it; we are using it
if record.OwnerID != nil && record.LastUpdate > cutoff {
Maybe this is related? Or should it just be replaced with OwnerName; or does the original comment no longer apply?
I believe that is a mistake in checkpoints.go and OwnerID should be in the block of used variables and OwnerName should be in the "debug only" bit. I will verify this and update the code. From my quick check ownerName is never checked but ownerID is checked.
Just checking in if there was ever a solution found for this issue. I'm also experiencing this @jney, @Sytten , @garethlewin, @cb-anthony-merrill
No.
Still have never managed to get a reproducable situation, just a "sometimes it happens."
I've read through the code and debugged it multiple times and I cannot find a reason for this. I do not want to point at AWS/DDB, so my assumption is there is a bug in kinsumer, I just can't find it.
If you can reproduce or add any kind of diagnostic information beyond what has been shared, I'd love to fix this.
Ah, fair enough @garethlewin ! Thanks for the quick response! In terms of reproduction, what I've noticed is that it happens whenever I boot up new containers in ECS and the older containers are attempting to release()
. I wonder if maybe the times I had configured for WithLeaderActionFrequency
and WithShardCheckFrequency
are causing issues (both set to 1 second which may be far to fast and causing issues)
Also, I think this is an issue when you have more containers than shards for a stream (e.g 3 containers consuming from 2 shards). In the case where all 3 containers deregistered together, 1 of those containers will not be the owner of a shard (which is to be expected) and thus, will fail DynamoDB query in release()
since it never was the owner of any shard. Is this correct @garethlewin?
Sadly, I tested the theory above just now and it still seemed to show up once when I brought up 2 new containers and deregistered 2 old ones. The odd thing is, Kinsumer seems to still properly consume from my stream so I'm not actually sure if this really is an issue I should be too worried about
So after quite a bit of testing with ECS, I think I can provide a bit more context into what is happening @garethlewin .
My test includes 2 containers that are initially running in ECS
and are consuming from a stream with 2
shards
Given this configuration, the checkpoints
table will have the following owner IDs
Shard | OwnerID | ... |
---|---|---|
Shard1 | Container 1 ID | ... |
Shard2 | Container 2 ID | ... |
Whenever I deploy new containers to ECS
, what happens is 2 completely new containers are deployed and for a brief period of time, there are 4 containers running together. During this period of time, it appears that Container 3
and Container 4
are sometimes assigned ownership of the shards which makes the checkpoints table look like this:
Shard | OwnerID | ... |
---|---|---|
Shard1 | Container 3 ID | ... |
Shard2 | Container 4 ID | ... |
So, when ECS attempts to scale down to the desired 2 containers and deregister Container1
and Container2
, it will fail to remove their ownership from the checkpoints
table as they are no longer the owners
It's important to note that I can't seem to simulate this locally running multiple instances of my service but I can easily reproduce it in ECS
. Given this information, do you have any idea why Kinsumer
would be promoting the two new containers to be the owners of the shards prematurely?
Any help is really appreciated 😃
Sorry for all the messages @garethlewin , but I think I have some additional information that may be useful as well.
I'm wondering if sorting the clients by OwnerID
in the getClients
function (https://github.com/twitchscience/kinsumer/blob/master/clients.go#L123)
https://github.com/twitchscience/kinsumer/blob/master/clients.go#L36-L38
is potentially problematic and aggravating this issue as well.
It looks like getClients
is used within refreshShards
and it seems like the order in which these clients are returned has a direct impact on whether or not a client is "promoted" or not https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L137-L157
So, with that being said, if the list is returned in a "new" order, it potentially leads to new clients being promoted and older clients no longer having their `ID associated with a checkpoint when the attempt to "release". It seems a bit odd to be sorting on a "random" UUID since it will lead to different outcomes each time (if I'm understanding correctly)
Let me know what you think of this and if there's any other information I can provide 😄
No worries about the comments, but I no longer maintain this package so I might not be super fast and responsive but I'll do my best.
I haven't read (or written) this code in several years so I need to knock the rust off, but the sorting by OwnerID is primary idea that I built this entire library around.
By having all clients all sort by the same field, we get a consistent understanding between all the clients over which shard gets allocated to which client. We all agree that shard 0 goes to the first client etc. When new clients are added to the system it is correct, a reallocation of shards will happen. But everyone is supposed to take a break and then move to the newly allocated shards.
Your ECS vs Local example is worrying to me. I wonder if there is a timestamp issue here. The older workers are supposed to give up consuming their shards and it seems like you are finding they are not giving it up.
Here is where everyone is supposed to check what shards they should consume:
https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L421
when we capture shard, we are supposed to not take one until the existing reader has said they are done with it
https://github.com/twitchscience/kinsumer/blob/master/shard_consumer.go#L75
I wonder if there is a race condition between releasing a shard and shutting down.
Thanks for all the details @garethlewin !
Okay, so it seems like sorting is to be expected here to "mix things up" and keep a healthy rotation. Yeah, it seems like there is some sort of issue where we basically are "prematurely" assigning new owners to shards before the older consumers can "release" their hold on it
To be honest, I've done quite a bit of testing and since the error happens on shutdown anyways, I simply call kinsumer.Stop()
to ensure we clear the client table, etc. Ideally I'd love to have no errors on shutdown but it doesn't seem to actually effect the functionality as it's basically just saying
"I couldn't find that I was the Owner of a checkpoint, so I can't clear my ownership"
Given that you no longer maintain the project, I really appreciate you going out of your way to help me 😄 !
No, the sorting isn't to mix things up. It's so that all the clients have consensus over who owns a shard, and who doesn't.
You should DEFINITELY call kinsumer.Stop() on shutdown or else your clients will hold onto the shards for longer than is needed.
Oh, I see! Sorry for the confusion!
Just to clarify, I wonder if it's possible that release()
is written to "fail" on any failure to update, when in reality, there are scenarios where it isn't actually a failure
Like in the scenario above, if there are 4 consumers and only 2 shards, we are guaranteed to have at least 2 consumers that don't own a checkpoint. If I'm understanding correctly, there's nothing stopping the 2 consumers that don't have ownership from exiting, and if so, it isn't an error to clear their ownership since they don't have any.
I wonder if the improved logic for release()
would be:
What do you think of that @garethlewin ? 😄 https://github.com/twitchscience/kinsumer/blob/master/checkpoints.go#L203-L236
I think you might be on to something, there is an error for "No shard assigned". But I am not sure what happens in that case if the consumer had a shard before.
That's good to hear!
For the "No shard assigned" case, I think this section here https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L205-L207 covers the scenario where we have more consumers than shards which is why we wouldn't hit that particular error.
I think I'll look into creating a simple check for ownership and putting it in a PR for review. Once again, since you're not longer the maintainer of the project, don't feel you have to review, etc 😄
As I was saying earlier, now that we have the knowledge above about when the scenario occurs and how to safely clean up in this scenario, it's more about just removing the error so that others in the future aren't lead in the wrong direction
Hello there, I'm now always getting the following error:
Any idea what's happened and how to fix that?
btw, I noticed #27 ticket
Thank you