Rudderstack becomes extremely slow when you have one destination down

fackyhigh commented 1 month ago

Describe the bug We have data-plane running in k8s. There are 60 pods. When we see in Grafana that one of our Webhook destinations is down, then Rudderstack becomes extremely slow. Webhook delivery time increases dramatically, rt tables count increases from 2 to ~30 per PostgreSQL pod, webhook event sync lag time goes from 10 second to one hour almost.

Steps to reproduce the bug Enter the steps to reproduce the behavior.

Configure multiple webhooks as destinations for one source
Create a load
Fail one of the webhooks
See how rt tables count, webhook delivery time and event sync lag time grows.

Expected behavior When destination is down the system is still running fast and it doesn't affect other destinations.

Screenshots

Any additional context Rudderstack version is 1.28.1

Please, tell us what to tweak so Rudderstack could work as usual at the times when one destination may go down. As well It'l be appreciated if you share how the retry logic actually works and why it affects other destinations.

gitcommitshow commented 1 month ago

Can you share what response do you see for /health API call (run curl {DATA_PLANE_URL}/health, ideally when this issue occur otherwise share whatever you see right now)?

gitcommitshow commented 1 month ago

@fackyhigh can you please also share the config value for Router.isolationMode and Router.WEBHOOK.isolationMode? This will help investigate the issue.

fackyhigh commented 1 month ago

Hello @gitcommitshow!

Here is:

{"appType":"EMBEDDED","server":"UP","db":"UP","acceptingEvents":"TRUE","routingEvents":"TRUE","mode":"NORMAL","backendConfigMode":"JSON","lastSync":"2024-08-12T21:17:29Z","lastRegulationSync":""}

BTW I forgot to send you a config:

maxProcess: 12
gwDBRetention: 0h
routerDBRetention: 0h
enableProcessor: true
enableRouter: true
enableStats: true
statsTagsFormat: influxdb
Http:
  ReadTimeout: 0s
  ReadHeaderTimeout: 0s
  WriteTimeout: 10s
  IdleTimeout: 720s
  MaxHeaderBytes: 524288
RateLimit:
  eventLimit: 1000
  rateLimitWindow: 60m
  noOfBucketsInWindow: 12
Gateway:
  webPort: 8080
  maxUserWebRequestWorkerProcess: 128
  maxDBWriterProcess: 512
  CustomVal: GW
  maxUserRequestBatchSize: 128
  maxDBBatchSize: 128
  userWebRequestBatchTimeout: 15ms
  dbBatchWriteTimeout: 5ms
  maxReqSizeInKB: 4000
  enableRateLimit: false
  enableSuppressUserFeature: true
  allowPartialWriteWithErrors: true
  allowReqsWithoutUserIDAndAnonymousID: false
  webhook:
    batchTimeout: 20ms
    maxBatchSize: 32
    maxTransformerProcess: 64
    maxRetry: 5
    maxRetryTime: 10s
    sourceListForParsingParams:
      - shopify
EventSchemas:
  enableEventSchemasFeature: false
  syncInterval: 240s
  noOfWorkers: 128
Debugger:
  maxBatchSize: 32
  maxESQueueSize: 1024
  maxRetry: 3
  batchTimeout: 2s
  retrySleep: 100ms
SourceDebugger:
  disableEventUploads: false
DestinationDebugger:
  disableEventDeliveryStatusUploads: false
TransformationDebugger:
  disableTransformationStatusUploads: false
Archiver:
  backupRowsBatchSize: 100
JobsDB:
  jobDoneMigrateThres: 0.8
  jobStatusMigrateThres: 5
  maxDSSize: 100000
  maxMigrateOnce: 10
  maxMigrateDSProbe: 10
  maxTableSizeInMB: 300
  migrateDSLoopSleepDuration: 30s
  addNewDSLoopSleepDuration: 5s
  refreshDSListLoopSleepDuration: 5s
  backupCheckSleepDuration: 5s
  backupRowsBatchSize: 1000
  archivalTimeInDays: 3
  archiverTickerTime: 1440m
  backup:
    enabled: false
    gw:
      enabled: true
      pathPrefix: ""
    rt:
      enabled: true
      failedOnly: true
    batch_rt:
      enabled: false
      failedOnly: false
Router:
  jobQueryBatchSize: 10000
  updateStatusBatchSize: 1000
  readSleep: 1000ms
  fixedLoopSleep: 0ms
  noOfJobsPerChannel: 1000
  noOfJobsToBatchInAWorker: 20
  jobsBatchTimeout: 5s
  maxSleep: 60s
  minSleep: 0s
  maxStatusUpdateWait: 5s
  useTestSink: false
  guaranteeUserEventOrder: true
  kafkaWriteTimeout: 2s
  kafkaDialTimeout: 10s
  minRetryBackoff: 10s
  maxRetryBackoff: 300s
  noOfWorkers: 64
  allowAbortedUserJobsCountForProcessing: 1
  maxFailedCountForJob: 3
  retryTimeWindow: 180m
  failedKeysEnabled: false
  saveDestinationResponseOverride: false
  responseTransform: false
  MARKETO:
    noOfWorkers: 4
  throttler:
    MARKETO:
      limit: 45
      timeWindow: 20s
  BRAZE:
    forceHTTP1: true
    httpTimeout: 120s
    httpMaxIdleConnsPerHost: 32
BatchRouter:
  mainLoopSleep: 2s
  jobQueryBatchSize: 100000
  uploadFreq: 30s
  warehouseServiceMaxRetryTime: 3h
  noOfWorkers: 8
  maxFailedCountForJob: 128
  retryTimeWindow: 180m
  datePrefixOverride: "YYYY-MM-DD"
Warehouse:
  mode: embedded
  webPort: 8082
  uploadFreq: 1800s
  noOfWorkers: 8
  noOfSlaveWorkerRoutines: 4
  mainLoopSleep: 5s
  minRetryAttempts: 3
  retryTimeWindow: 180m
  minUploadBackoff: 60s
  maxUploadBackoff: 1800s
  warehouseSyncPreFetchCount: 10
  warehouseSyncFreqIgnore: false
  stagingFilesBatchSize: 960
  enableIDResolution: false
  populateHistoricIdentities: false
  redshift:
    maxParallelLoads: 3
    setVarCharMax: false
  snowflake:
    maxParallelLoads: 3
  bigquery:
    maxParallelLoads: 20
  postgres:
    maxParallelLoads: 3
  mssql:
    maxParallelLoads: 3
  azure_synapse:
    maxParallelLoads: 3
  clickhouse:
    maxParallelLoads: 3
    queryDebugLogs: false
    blockSize: 200000
    poolSize: 10
    disableNullable: true
    enableArraySupport: false
Processor:
  webPort: 8086
  loopSleep: 10ms
  maxLoopSleep: 5000ms
  fixedLoopSleep: 0ms
  maxLoopProcessEvents: 10000
  transformBatchSize: 100
  userTransformBatchSize: 200
  maxConcurrency: 200
  maxHTTPConnections: 100
  maxHTTPIdleConnections: 50
  maxRetry: 30
  retrySleep: 100ms
  timeoutDuration: 30s
  errReadLoopSleep: 30s
  errDBReadBatchSize: 1000
  noOfErrStashWorkers: 2
  maxFailedCountForErrJob: 3
  Stats:
    # have event name as label in prometheus metrics
    captureEventName: true
  dbReadBatchSize: 50000
  maxChanSize: 8192
Dedup:
  enableDedup: false
  dedupWindow: 900s
BackendConfig:
  configFromFile: false
  configJSONPath: /etc/rudderstack/workspaceconfig/workspaceConfig.json
  pollInterval: 5s
  regulationsPollInterval: 300s
  maxRegulationsPerRequest: 1000
recovery:
  enabled: true
  errorStorePath: /tmp/error_store.json
  storagePath: /tmp/recovery_data.json
  normal:
    crashThreshold: 5
    duration: 300s
Logger:
  enableConsole: true
  enableFile: false
  consoleJsonFormat: false
  fileJsonFormat: false
  logFileLocation: /tmp/rudder_log.log
  logFileSize: 100
  enableTimestamp: true
  enableFileNameInLog: true
  enableStackTrace: false
Diagnostics:
  enableDiagnostics: true
  gatewayTimePeriod: 60s
  routerTimePeriod: 60s
  batchRouterTimePeriod: 6l
  enableServerStartMetric: true
  enableConfigIdentifyMetric: true
  enableServerStartedMetric: true
  enableConfigProcessedMetric: true
  enableGatewayMetric: true
  enableRouterMetric: true
  enableBatchRouterMetric: true
  enableDestinationFailuresMetric: true
RuntimeStats:
  enabled: true
  statsCollectionInterval: 10
  enableCPUStats: true
  enableMemStats: true
  enableGCStats: true
PgNotifier:
  retriggerInterval: 2s
  retriggerCount: 500
  trackBatchInterval: 2s

fackyhigh commented 1 month ago

Regarding Router.isolationMode and Router.WEBHOOK.isolationMode we don't have such settings. As well they aren't present in the sample config inside rudder-server repo nor here: https://www.rudderstack.com/docs/user-guides/administrators-guide/config-parameters/

gitcommitshow commented 3 weeks ago

Config seems correct. Backlog on the router tables is because we retry for 3h when service is unavailable. After 3h events will be aborted as can be seen in the graph after 3 hours backlog started coming down.

For event sync lag time, @fackyhigh can you please check the sync lag for different destID as sync lag for destType=Webhook would be high because it will be avg of all the destinations with type webhook and one destination is slow.

tl;dr: Aggregate the sync lag time data by destID (as opposed to destType)

fackyhigh commented 3 weeks ago

Hello @gitcommitshow.

I'm not sure that I've understood you. I have a couple questions:

What do you mean when you speaking about the backlog?
Can you tell me more deeply how this retry logic works?
About which graph are you speaking when you've mentioned that backlog started coming down?

tl;dr: Aggregate the sync lag time data by destID (as opposed to destType) — It's aggregated by destType and destID:

All graph lines are the same. Is there any other metrics?

I've experimented with delivery_latency and all graph lines are the same except one:

To be clear we have 7 destination with a type of WEBHOOK. And those destinations are only 3 services. 2 out of 3 services has 3 API endpoint for Android, iOS and Web events. So 2 destination with 3 endpoints each and 1 destination with one endpoint.

fackyhigh commented 3 weeks ago

I'm starting to think that destinations may be good and not being down at all. Because I've played with rate(sum(router_response_counts{respStatusCode!~"2[0-9]."}) by (destination,destType,respStatusCode)[1m]) and the max values is 0.8 error per minute:

What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state? Are there any additional metrics which can help us to understand the root cause?

gitcommitshow commented 3 weeks ago

What do you mean when you speaking about the backlog?

The incoming events are queued, and removed from the queue when they are delivered to the destination. The pending events that need to be delivered is the backlog.

About which graph are you speaking when you've mentioned that backlog started coming down?

Both the graph/screenshots you shared in the original post

fackyhigh commented 2 weeks ago

@gitcommitshow, hello.

Is there any about the problem info? I've provided the information you requested.

As well you haven't answered these questions:

Can you tell me more deeply how this retry logic works?

What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state?

Are there any additional metrics which can help us to understand the root cause?

gitcommitshow commented 2 weeks ago

Allow me some more time to get back to you on the pending questions

gitcommitshow commented 2 weeks ago

Received some inputs from the team as following

Can you tell me more deeply how this retry logic works?

All the 429s, 408s and 5xx status codes are retried, rest all the jobs are aborted. 5xx errors are retried for 180 mins and 3 attempts(as per the config shared) and events will be aborted after that. 429 and 408 are retried until any terminal status code is returned.

What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state?

webhook destination might be returning 429s, 408s or 5xx status code and since these status codes are retried new events would not be processed(to maintain event ordering) resulting in waiting state.

Are there any additional metrics which can help us to understand the root cause?

we can check router_response_counts and see what is the distribution of error at the time when backlog has increased.

We can't do much if destination is slow as we have to keep storing events until destination is up again so that we can deliver the events.

fackyhigh commented 2 weeks ago

@gitcommitshow But we have almost no errors for the last 30 days which you have mentioned (429s, 408s and 5xx):

And still we have RT tables growth every day:

And the strange thing is that we have 80 rudder-server pods, but we see tables growth only on specific pods:

We always have this problem with RT growth only on small subset of the whole pod pool.

fackyhigh commented 1 week ago

Hey there @gitcommitshow.

Are there any new hypotheses?

I have a couple of questions more: 1) Are there any other reasons for RT tables growth except 429s, 408s or 5xx errors? 2) Can be one of destinations be simply slow? For example, 1 destination is slow but without errors, and other events are waiting to maintain event order. 3) Which settings can be tweaked to change this retry logic to abort failed events faster? 4) How can we understand if we have retries when we have no errors? Maybe your team has an SQL script which can get all events from all tables and print destinations which have most retries OR longest event delivery? 5) If we change guaranteeUserEventOrder to false, will it speed up the process?

gitcommitshow commented 1 week ago

Hi @fackyhigh , thank you for sharing specific questions. I got some more inputs from the team as following

Can you please share the last statuses per job in the first rt_job_status_* table - can be done via select * from v_last_rt_job_status_<first>

Which settings can be tweaked to change this retry logic to abort failed events faster?

doesn't matter since you said there's no errors in the last 30 days

you have 80 rudder-server pods, but you still see tables growth only on specific pods

it's possible there's a heavy skew in the userIDs of your events - which could also lead to some slowdown as you've noted in some other point talking about event ordering

change guaranteeUserEventOrder to false, will it speed up the process?

it could, but only if there's a lot of events from a few userIDs

you could also try increasing the router workers(more workers pushing out events concurrently) - increase Router.<destType>.noOfWorkers for specific destType or Router.noOfWorkers for all. Default is 64, so increase this number step by step.

also Router.<destType>.guaranteeUserEventOrder for specific destType or Router.guaranteeUserEventOrder if event ordering's not needed anywhere

fackyhigh commented 1 week ago

It's evening now, so a traffic spike goes down and RT tables count goes down as well. So, for now, we have only executing and succeeded statuses, but we'll see what happens tomorrow.

Can you clarify only one thing? What is the UserID? Is this a part of event or what? Is this something that we can handle or manage?

With rest of your hints I'll experiment and get back to you. Have a great day and thank you for the lots of useful information!!

gitcommitshow commented 1 week ago

Can you clarify only one thing? What is the UserID? Is this a part of event or what? Is this something that we can handle or manage?

Yes, it is part of the event data. To be more specific, it is either the userId or anonymousId (when userId is not available). One of these (userId or anonymousId) are mandatory to identify user.

Reference - https://www.rudderstack.com/docs/event-spec/standard-events/identify/#user-id-vs-anonymous-id

rudderlabs / rudder-server

Rudderstack becomes extremely slow when you have one destination down #4953