Closed ghost closed 6 years ago
Turns out this was an issue on our end. We logged the duration of each DynamoDB query made by npdynamodb and most are under 150ms. But when running in our Kubernetes cluster, this time spikes to over 5000ms (the default timeout is 5000ms) once every few hundred requests.
Running it locally against a local DynamoDB instance or an AWS hosted DynamoDB instance, we don't see the high spike in request time. So this is definitely an issue within our Kubernetes cluster.
Here's the migration that timed out in under 5 minutes:
And the error while doing this migration:
It happens sporadically and isn't always reproducible but if you run the migration enough times it will be triggered. Seems to happen for ~creating tables and~ creating global secondary indexes.
This is a big issue for us because it makes the database migrations manual instead of automated.
Edit: After further testing, it seems to only happen when creating global secondary indexes and it fails in under a minute or two.