Closed aleks-p closed 3 weeks ago
Thanks for getting to the root of the problem!
I believe we should improve observability means to keep track of the compaction queue. In another PR I left a note and I'll prioritize this.
Assuming the current time is t
and the lease deadline expires at t-2
(two ticks ago).
UNASSIGNED <- Jobs without owners, no issues here: we start with the lowest level.
LEVEL_0
t-1
t-0
LEVEL_1
t-1
t-0
IN_PROGRESS <- Job that have an owner assigned. There are cases when an owner abandons a job. We wait for the lease expiration before re-assignment the job.
LEVEL_0
t-3 <- Abandoned, eligible for re-assignment.
t-2 <- Abandoned, eligible for re-assignment.
t-1 <- Still owned. WE STOP HERE.
LEVEL_1
t-3 <- This job IS eligible for re-assignment but we skip it.
t-2 <- This job IS eligible for re-assignment but we skip it.
t-1 <- Still owned.
CANCELLED <- Broken jobs that we should not touch.
LEVEL_0
t-N
t-N-1
LEVEL_1
t-N
t-N-1
My assumption was that we need SJF scheduling: Shortest Job First – we do want to finish L0 jobs before handling L1 regardless of anything. I still believe this is the most beneficial strategy for us.
Theoretically, if we have enough compaction-worker capacities, the L1 jobs with expired leases from the example above will eventually get compacted. It looks like it's not the case in practice – abandoned jobs starve.
I agree that we should not keep jobs eligible for re-assignment in the queue longer than necessary.
Fix: I've changed the order to: lease expiry -> compaction level -> name
I see a couple of issues here:
dequeue
.I would approach it in a different way:
We could have a queue of queues, but that would be an overhead given that the set of queues is static and we want to always assign Small Job First.
This is what I'm going to implement in the PR I'm working on. If this fix can wait, and you have no objections, I'd prefer to avoid merge conflicts.
I agree – I created a separate issue.
However, in this particular case (and in many others), we do not need to resort to complex migrations. We only need to create a "point of linearizability" - this problem is not unique to Raft or consensus-based systems: a recent example.
Rule of thumb: parameters should be part of the raft log and the FSM state.
How I would approach this:
This strategy should generally work well for parameters (and anything we can parameterize) as well as for changes that do not directly impact the storage schema.
I agree that we should not keep jobs eligible for re-assignment in the queue longer than necessary. When it's time for big jobs, these will be scheduled before smaller jobs (almost all at once) which will delay L0 compactions, impacting the read path. This contradicts the SJF approach.
To clarify, the issue is more severe than simply keeping jobs in the queue longer than necessary. In our dev environment the current implementation lead to an ever growing queue (over 500 jobs) because we always have L0 jobs in progress. This actually made the read path slower and applying the change introduced a noticeable improvement.
Although it seems that the change I am proposing goes against SJF, in reality new L0 jobs are always prioritized first, the change only bumps expired jobs ahead of other expired jobs. A queue per level will definitely be ideal, but this change actually gets us 95% of the way there (after fixing the bug with cancelled jobs).
This is what I'm going to implement in https://github.com/grafana/pyroscope/pull/3652 I'm working on. If this fix can wait, and you have no objections, I'd prefer to avoid merge conflicts.
I think for the sake of clarity #3652 should really focus on refactoring without functionality changes. The reason I addressed this now is because it impacted the read performance severely which I am working on in parallel to address. I see however that you're already working on this so I'll close this.
While investigating performance issues on the read path I noticed non-compacted blocks being used for time ranges way in the past.
Looking at logs and a snapshot of these jobs, it looks like we have in-progress jobs that become undiscoverable. This happens when a compaction worker shuts down or is unable to reach the metastore to send its completion update. The in-progress jobs stay in the queue and due to the current ordering they never get assigned.
Current ordering: status -> compaction level -> lease expiry -> name
Example:
j := q.dequeue(now, index)
->L0-S12-15852522290054690452
✅j := q.dequeue(now, index)
-> nil ❎ (the L0 job is still first but can't be assigned since it hasn't expired)The first job can't get assigned while there are L0 jobs in the queue. In-progress L0 jobs can also get stuck in the queue if we have a steady supply of new L0 jobs.
Fix: I've changed the order to: lease expiry -> compaction level -> name
New jobs have a 0 lease expiry so they will still be prioritized over in-progress jobs. Once all new jobs gets assigned, we start assigning expired jobs in the order they were originally assigned.
Alternatively we could replace the priority queue with a sorted slice and keep iterating over it until we find a job that can be assigned. This will be more expensive but will allow us to inspect the job order which is currently not easy.
Important: Deploying a change like this will introduce state inconsistencies during rollouts because pods will assign jobs differently. This is not a unique problem for this change and we probably need to address it in a more general way.