Open fackyhigh opened 1 month ago
Can you share what response do you see for /health
API call (run curl {DATA_PLANE_URL}/health
, ideally when this issue occur otherwise share whatever you see right now)?
@fackyhigh can you please also share the config value for Router.isolationMode
and Router.WEBHOOK.isolationMode
? This will help investigate the issue.
Hello @gitcommitshow!
Here is:
{"appType":"EMBEDDED","server":"UP","db":"UP","acceptingEvents":"TRUE","routingEvents":"TRUE","mode":"NORMAL","backendConfigMode":"JSON","lastSync":"2024-08-12T21:17:29Z","lastRegulationSync":""}
BTW I forgot to send you a config:
maxProcess: 12
gwDBRetention: 0h
routerDBRetention: 0h
enableProcessor: true
enableRouter: true
enableStats: true
statsTagsFormat: influxdb
Http:
ReadTimeout: 0s
ReadHeaderTimeout: 0s
WriteTimeout: 10s
IdleTimeout: 720s
MaxHeaderBytes: 524288
RateLimit:
eventLimit: 1000
rateLimitWindow: 60m
noOfBucketsInWindow: 12
Gateway:
webPort: 8080
maxUserWebRequestWorkerProcess: 128
maxDBWriterProcess: 512
CustomVal: GW
maxUserRequestBatchSize: 128
maxDBBatchSize: 128
userWebRequestBatchTimeout: 15ms
dbBatchWriteTimeout: 5ms
maxReqSizeInKB: 4000
enableRateLimit: false
enableSuppressUserFeature: true
allowPartialWriteWithErrors: true
allowReqsWithoutUserIDAndAnonymousID: false
webhook:
batchTimeout: 20ms
maxBatchSize: 32
maxTransformerProcess: 64
maxRetry: 5
maxRetryTime: 10s
sourceListForParsingParams:
- shopify
EventSchemas:
enableEventSchemasFeature: false
syncInterval: 240s
noOfWorkers: 128
Debugger:
maxBatchSize: 32
maxESQueueSize: 1024
maxRetry: 3
batchTimeout: 2s
retrySleep: 100ms
SourceDebugger:
disableEventUploads: false
DestinationDebugger:
disableEventDeliveryStatusUploads: false
TransformationDebugger:
disableTransformationStatusUploads: false
Archiver:
backupRowsBatchSize: 100
JobsDB:
jobDoneMigrateThres: 0.8
jobStatusMigrateThres: 5
maxDSSize: 100000
maxMigrateOnce: 10
maxMigrateDSProbe: 10
maxTableSizeInMB: 300
migrateDSLoopSleepDuration: 30s
addNewDSLoopSleepDuration: 5s
refreshDSListLoopSleepDuration: 5s
backupCheckSleepDuration: 5s
backupRowsBatchSize: 1000
archivalTimeInDays: 3
archiverTickerTime: 1440m
backup:
enabled: false
gw:
enabled: true
pathPrefix: ""
rt:
enabled: true
failedOnly: true
batch_rt:
enabled: false
failedOnly: false
Router:
jobQueryBatchSize: 10000
updateStatusBatchSize: 1000
readSleep: 1000ms
fixedLoopSleep: 0ms
noOfJobsPerChannel: 1000
noOfJobsToBatchInAWorker: 20
jobsBatchTimeout: 5s
maxSleep: 60s
minSleep: 0s
maxStatusUpdateWait: 5s
useTestSink: false
guaranteeUserEventOrder: true
kafkaWriteTimeout: 2s
kafkaDialTimeout: 10s
minRetryBackoff: 10s
maxRetryBackoff: 300s
noOfWorkers: 64
allowAbortedUserJobsCountForProcessing: 1
maxFailedCountForJob: 3
retryTimeWindow: 180m
failedKeysEnabled: false
saveDestinationResponseOverride: false
responseTransform: false
MARKETO:
noOfWorkers: 4
throttler:
MARKETO:
limit: 45
timeWindow: 20s
BRAZE:
forceHTTP1: true
httpTimeout: 120s
httpMaxIdleConnsPerHost: 32
BatchRouter:
mainLoopSleep: 2s
jobQueryBatchSize: 100000
uploadFreq: 30s
warehouseServiceMaxRetryTime: 3h
noOfWorkers: 8
maxFailedCountForJob: 128
retryTimeWindow: 180m
datePrefixOverride: "YYYY-MM-DD"
Warehouse:
mode: embedded
webPort: 8082
uploadFreq: 1800s
noOfWorkers: 8
noOfSlaveWorkerRoutines: 4
mainLoopSleep: 5s
minRetryAttempts: 3
retryTimeWindow: 180m
minUploadBackoff: 60s
maxUploadBackoff: 1800s
warehouseSyncPreFetchCount: 10
warehouseSyncFreqIgnore: false
stagingFilesBatchSize: 960
enableIDResolution: false
populateHistoricIdentities: false
redshift:
maxParallelLoads: 3
setVarCharMax: false
snowflake:
maxParallelLoads: 3
bigquery:
maxParallelLoads: 20
postgres:
maxParallelLoads: 3
mssql:
maxParallelLoads: 3
azure_synapse:
maxParallelLoads: 3
clickhouse:
maxParallelLoads: 3
queryDebugLogs: false
blockSize: 200000
poolSize: 10
disableNullable: true
enableArraySupport: false
Processor:
webPort: 8086
loopSleep: 10ms
maxLoopSleep: 5000ms
fixedLoopSleep: 0ms
maxLoopProcessEvents: 10000
transformBatchSize: 100
userTransformBatchSize: 200
maxConcurrency: 200
maxHTTPConnections: 100
maxHTTPIdleConnections: 50
maxRetry: 30
retrySleep: 100ms
timeoutDuration: 30s
errReadLoopSleep: 30s
errDBReadBatchSize: 1000
noOfErrStashWorkers: 2
maxFailedCountForErrJob: 3
Stats:
# have event name as label in prometheus metrics
captureEventName: true
dbReadBatchSize: 50000
maxChanSize: 8192
Dedup:
enableDedup: false
dedupWindow: 900s
BackendConfig:
configFromFile: false
configJSONPath: /etc/rudderstack/workspaceconfig/workspaceConfig.json
pollInterval: 5s
regulationsPollInterval: 300s
maxRegulationsPerRequest: 1000
recovery:
enabled: true
errorStorePath: /tmp/error_store.json
storagePath: /tmp/recovery_data.json
normal:
crashThreshold: 5
duration: 300s
Logger:
enableConsole: true
enableFile: false
consoleJsonFormat: false
fileJsonFormat: false
logFileLocation: /tmp/rudder_log.log
logFileSize: 100
enableTimestamp: true
enableFileNameInLog: true
enableStackTrace: false
Diagnostics:
enableDiagnostics: true
gatewayTimePeriod: 60s
routerTimePeriod: 60s
batchRouterTimePeriod: 6l
enableServerStartMetric: true
enableConfigIdentifyMetric: true
enableServerStartedMetric: true
enableConfigProcessedMetric: true
enableGatewayMetric: true
enableRouterMetric: true
enableBatchRouterMetric: true
enableDestinationFailuresMetric: true
RuntimeStats:
enabled: true
statsCollectionInterval: 10
enableCPUStats: true
enableMemStats: true
enableGCStats: true
PgNotifier:
retriggerInterval: 2s
retriggerCount: 500
trackBatchInterval: 2s
Regarding Router.isolationMode
and Router.WEBHOOK.isolationMode
we don't have such settings. As well they aren't present in the sample config inside rudder-server repo nor here: https://www.rudderstack.com/docs/user-guides/administrators-guide/config-parameters/
Config seems correct. Backlog on the router tables is because we retry for 3h when service is unavailable. After 3h events will be aborted as can be seen in the graph after 3 hours backlog started coming down.
For event sync lag time, @fackyhigh can you please check the sync lag for different destID
as sync lag for destType=Webhook
would be high because it will be avg of all the destinations with type webhook and one destination is slow.
tl;dr: Aggregate the sync lag time data by destID
(as opposed to destType
)
Hello @gitcommitshow.
I'm not sure that I've understood you. I have a couple questions:
tl;dr: Aggregate the sync lag time data by destID (as opposed to destType) — It's aggregated by destType and destID:
All graph lines are the same. Is there any other metrics?
I've experimented with delivery_latency
and all graph lines are the same except one:
To be clear we have 7 destination with a type of WEBHOOK. And those destinations are only 3 services. 2 out of 3 services has 3 API endpoint for Android, iOS and Web events. So 2 destination with 3 endpoints each and 1 destination with one endpoint.
I'm starting to think that destinations may be good and not being down at all. Because I've played with rate(sum(router_response_counts{respStatusCode!~"2[0-9]."}) by (destination,destType,respStatusCode)[1m])
and the max values is 0.8 error per minute:
What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state? Are there any additional metrics which can help us to understand the root cause?
What do you mean when you speaking about the backlog?
The incoming events are queued, and removed from the queue when they are delivered to the destination. The pending events that need to be delivered is the backlog.
About which graph are you speaking when you've mentioned that backlog started coming down?
Both the graph/screenshots you shared in the original post
@gitcommitshow, hello.
Is there any about the problem info? I've provided the information you requested.
As well you haven't answered these questions:
Can you tell me more deeply how this retry logic works?
What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state?
Are there any additional metrics which can help us to understand the root cause?
Allow me some more time to get back to you on the pending questions
Received some inputs from the team as following
Can you tell me more deeply how this retry logic works?
All the 429s, 408s and 5xx status codes are retried, rest all the jobs are aborted. 5xx errors are retried for 180 mins and 3 attempts(as per the config shared) and events will be aborted after that. 429 and 408 are retried until any terminal status code is returned.
What else except errors can lead to WEBHOOK(rt) tables growth and start putting events to pending state?
webhook destination might be returning 429s, 408s or 5xx status code and since these status codes are retried new events would not be processed(to maintain event ordering) resulting in waiting state.
Are there any additional metrics which can help us to understand the root cause?
we can check router_response_counts
and see what is the distribution of error at the time when backlog has increased.
We can't do much if destination is slow as we have to keep storing events until destination is up again so that we can deliver the events.
@gitcommitshow But we have almost no errors for the last 30 days which you have mentioned (429s, 408s and 5xx):
And still we have RT tables growth every day:
And the strange thing is that we have 80 rudder-server pods, but we see tables growth only on specific pods:
We always have this problem with RT growth only on small subset of the whole pod pool.
Hey there @gitcommitshow.
Are there any new hypotheses?
I have a couple of questions more:
1) Are there any other reasons for RT tables growth except 429s, 408s or 5xx errors?
2) Can be one of destinations be simply slow? For example, 1 destination is slow but without errors, and other events are waiting to maintain event order.
3) Which settings can be tweaked to change this retry logic to abort failed events faster?
4) How can we understand if we have retries when we have no errors? Maybe your team has an SQL script which can get all events from all tables and print destinations which have most retries OR longest event delivery?
5) If we change guaranteeUserEventOrder
to false
, will it speed up the process?
Hi @fackyhigh , thank you for sharing specific questions. I got some more inputs from the team as following
Can you please share the last statuses per job in the first rt_job_status_*
table - can be done via select * from v_last_rt_job_status_<first>
Which settings can be tweaked to change this retry logic to abort failed events faster?
doesn't matter since you said there's no errors in the last 30 days
you have 80 rudder-server pods, but you still see tables growth only on specific pods
it's possible there's a heavy skew in the userIDs of your events - which could also lead to some slowdown as you've noted in some other point talking about event ordering
change
guaranteeUserEventOrder
tofalse
, will it speed up the process?
it could, but only if there's a lot of events from a few userIDs
you could also try increasing the router workers(more workers pushing out events concurrently) - increase Router.<destType>.noOfWorkers
for specific destType
or Router.noOfWorkers
for all. Default is 64, so increase this number step by step.
also Router.<destType>.guaranteeUserEventOrder
for specific destType
or Router.guaranteeUserEventOrder
if event ordering's not needed anywhere
It's evening now, so a traffic spike goes down and RT tables count goes down as well. So, for now, we have only executing and succeeded statuses, but we'll see what happens tomorrow.
Can you clarify only one thing? What is the UserID? Is this a part of event or what? Is this something that we can handle or manage?
With rest of your hints I'll experiment and get back to you. Have a great day and thank you for the lots of useful information!!
Can you clarify only one thing? What is the UserID? Is this a part of event or what? Is this something that we can handle or manage?
Yes, it is part of the event data. To be more specific, it is either the userId
or anonymousId
(when userId
is not available). One of these (userId
or anonymousId
) are mandatory to identify user.
Reference - https://www.rudderstack.com/docs/event-spec/standard-events/identify/#user-id-vs-anonymous-id
Describe the bug We have data-plane running in k8s. There are 60 pods. When we see in Grafana that one of our Webhook destinations is down, then Rudderstack becomes extremely slow. Webhook delivery time increases dramatically, rt tables count increases from 2 to ~30 per PostgreSQL pod, webhook event sync lag time goes from 10 second to one hour almost.
Steps to reproduce the bug Enter the steps to reproduce the behavior.
Expected behavior When destination is down the system is still running fast and it doesn't affect other destinations.
Screenshots
Any additional context Rudderstack version is 1.28.1
Please, tell us what to tweak so Rudderstack could work as usual at the times when one destination may go down. As well It'l be appreciated if you share how the retry logic actually works and why it affects other destinations.