Crash Loop Back-Off when load-testing 2 replicas

vovayartsev commented 3 years ago

Environment: AWS's Kubernetes (via HELM chart), db.t3.large Postgres via RDS, 2GB RAM and 8vCPU per replica Docker Image: rudderlabs/rudder-server:14102020.053158

Steps to reproduce:

Set up Rudderstack in Kubernetes via the official helm chart as a single-replica StatefulSet
Load-test it
Scale up to 2 replicas by updating backendReplicaCount: 2 in Helm chart configuration, as suggested here
Load-test again

Expected: doubled throughput Actual: one of the replicas entered Crash Loop Back-Off with the following log message The destination (Apache Kafka) continued handling the messages already in the queue.

Load-testing was done via Apache Benchmark from Kubernetes:

ab -p rudderstack-ab.json -A <writekey>: -n 1000 -c 30 http://rudderstack/v1/batch

see rudderstack-ab.json

chandumlg commented 3 years ago

Thanks for reporting the issue.

At the first sight, the issue isn't related to Scale up. To further investigate the issue, we need more information. Can you please post /data/rudderstack/error_store.json on the crashing pod.

Thank you Chandu.

vovayartsev commented 3 years ago

Reproduced and captured: https://gist.github.com/vovayartsev/ec2a6ff48a0ff328f5be3f4cfd0da127

but the error looks slightly different this time: https://gist.github.com/vovayartsev/c2f19da4de42bbaf8cd9ece7c9b4db5b

I reproduce it easily with 2 replicas on a fresh db (it enters CrashLoopBackOff after ~200-300 batches), but it works really stable as 1 replica.

vovayartsev commented 3 years ago

Closing this issue as my setup was incorrect. I've been testing with RDS and both replicas were connected to the same RDS database.

As pointed out in Slack:

Yeah, that would not work. Both the servers cannot write to same set of tables. If you are using RDS try connecting each server to different database inside RDS

rudderlabs / rudder-server

Crash Loop Back-Off when load-testing 2 replicas #720