Open Bukhtawar opened 1 month ago
Looking at the stack-trace of thread dump, the workers are waiting for lock on the throttling-key (update-snapshot-state
), and the worker holding the lock is busy logging to file. The log is output at WARN level whenever task is throttled just before failing the transport call.
One possible fix to avoid lock contention, is to move out the following two code blocks outside of compute-if-absent
In that way the critical section is limited to incrementing the request count and not any other computation. This would avoid transport-workers getting blocked, if all of them happen to enqueue the tasks that belong to same throttling-key.
@Bukhtawar @shwetathareja - Thoughts ?
Do you think we should move the task submission and throttling logic off network threads to avoid getting into retry loops and stalling transport?
I see, the task submission involves the following two operations
Let me know if i am missing something.
Both of these operations, looks to be light-weight based on the current implementation and should not incur time.
I think we can profile submitTask
and evaluate if some of the operations needs to be moved to a background thread (working on the snapshot of PendingTask queue). For instance, the throttling decider can be moved to background thread and submitTask
can only enforce the throttling decision.
On the retry of request once throttled, i think this should happen from the caller, and not add retry handlers in cluster-manager. We can have cluster-manager send additional signals about PendingTaskQueue once the request is throttled, which can be used by caller to decide on retries.
Describe the bug
Related component
Cluster Manager
To Reproduce
Expected behavior
Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context Add any other context about the problem here.