aleks-p commented 4 weeks ago

While investigating performance issues on the read path I noticed non-compacted blocks being used for time ranges way in the past.

Looking at logs and a snapshot of these jobs, it looks like we have in-progress jobs that become undiscoverable. This happens when a compaction worker shuts down or is unable to reach the metastore to send its completion update. The in-progress jobs stay in the queue and due to the current ordering they never get assigned.

Current ordering: status -> compaction level -> lease expiry -> name

Example:

[
 {
      "name": "L1-S10-8435067274029917114",
      "compactionLevel": 1,
      "status": "COMPACTION_STATUS_IN_PROGRESS",
      "leaseExpiresAt": "1729588611850925503"
    },
    {
      "name": "L0-S12-15852522290054690452",
      "compactionLevel": 0,
      "status": "COMPACTION_STATUS_IN_PROGRESS",
      "leaseExpiresAt": "1730404565449365019"
    }
]

j := q.dequeue(now, index) -> L0-S12-15852522290054690452 ✅ j := q.dequeue(now, index) -> nil ❎ (the L0 job is still first but can't be assigned since it hasn't expired)

The first job can't get assigned while there are L0 jobs in the queue. In-progress L0 jobs can also get stuck in the queue if we have a steady supply of new L0 jobs.

Fix: I've changed the order to: lease expiry -> compaction level -> name

New jobs have a 0 lease expiry so they will still be prioritized over in-progress jobs. Once all new jobs gets assigned, we start assigning expired jobs in the order they were originally assigned.

Alternatively we could replace the priority queue with a sorted slice and keep iterating over it until we find a job that can be assigned. This will be more expensive but will allow us to inspect the job order which is currently not easy.

Important: Deploying a change like this will introduce state inconsistencies during rollouts because pods will assign jobs differently. This is not a unique problem for this change and we probably need to address it in a more general way.

kolesnikovae commented 3 weeks ago

Thanks for getting to the root of the problem!

I believe we should improve observability means to keep track of the compaction queue. In another PR I left a note and I'll prioritize this.

The issue

Assuming the current time is t and the lease deadline expires at t-2 (two ticks ago).

UNASSIGNED <- Jobs without owners, no issues here: we start with the lowest level. 
  LEVEL_0
     t-1
     t-0
  LEVEL_1
     t-1
     t-0

IN_PROGRESS <- Job that have an owner assigned. There are cases when an owner abandons a job. We wait for the lease expiration before re-assignment the job.
  LEVEL_0
     t-3 <- Abandoned, eligible for re-assignment.
     t-2 <- Abandoned, eligible for re-assignment.
     t-1 <- Still owned. WE STOP HERE.
  LEVEL_1
     t-3 <- This job IS eligible for re-assignment but we skip it.
     t-2 <- This job IS eligible for re-assignment but we skip it.
     t-1 <- Still owned. 

CANCELLED <- Broken jobs that we should not touch.
  LEVEL_0
     t-N
     t-N-1
  LEVEL_1
     t-N
     t-N-1

My assumption was that we need SJF scheduling: Shortest Job First – we do want to finish L0 jobs before handling L1 regardless of anything. I still believe this is the most beneficial strategy for us.

Theoretically, if we have enough compaction-worker capacities, the L1 jobs with expired leases from the example above will eventually get compacted. It looks like it's not the case in practice – abandoned jobs starve.

I agree that we should not keep jobs eligible for re-assignment in the queue longer than necessary.

The fix

Fix: I've changed the order to: lease expiry -> compaction level -> name

I see a couple of issues here:

The first cancelled job will block everything. We probably could set deadline to the max value or something like that, but I don't quite like it.
When it's time for big jobs, these will be scheduled before smaller jobs (almost all at once) which will delay L0 compactions, impacting the read path. This contradicts the SJF approach.
Very implicit dependency on the job status in dequeue.

I would approach it in a different way:

Split the queue by compaction level.
The order remains the same (w/o compaction level): Status -> Deadline -> Job name
- The job name should not bias the order in any way. This serves as a tiebreaker; we need a somewhat random (unbiased) deterministic order here. I plan to move the hash token to the beginning of the name.
When we are searching for a job to assign, we iterate over the queues in ascending level order:
- If there are any unassigned L0 jobs, we pick them. If there are any L0 abandoned jobs, we pick them next.
- If there are any unassigned L1 jobs, we pick them. If there are any L1 abandoned jobs, we pick them next.
- etc.

We could have a queue of queues, but that would be an overhead given that the set of queues is static and we want to always assign Small Job First.

This is what I'm going to implement in the PR I'm working on. If this fix can wait, and you have no objections, I'd prefer to avoid merge conflicts.

Migrations

I agree – I created a separate issue.

However, in this particular case (and in many others), we do not need to resort to complex migrations. We only need to create a "point of linearizability" - this problem is not unique to Raft or consensus-based systems: a recent example.

Rule of thumb: parameters should be part of the raft log and the FSM state.

How I would approach this:

I parametrize the queue order so that it becomes part of the queue state to be persisted.
I enable the new order when all replicas have been updated (similar to a feature toggle).
When a command is proposed (not applied) by the leader, the order parameter is included in the command payload.
When the command is applied, the queue order is changed according to the parameter (if needed). If leadership is lost and the new leader has not yet discovered the new parameter value (there is a small chance of this happening), the order will be reverted, but system consistency will not be compromised.
The change can be rolled back at any time.

This strategy should generally work well for parameters (and anything we can parameterize) as well as for changes that do not directly impact the storage schema.

aleks-p commented 3 weeks ago

I agree that we should not keep jobs eligible for re-assignment in the queue longer than necessary. When it's time for big jobs, these will be scheduled before smaller jobs (almost all at once) which will delay L0 compactions, impacting the read path. This contradicts the SJF approach.

To clarify, the issue is more severe than simply keeping jobs in the queue longer than necessary. In our dev environment the current implementation lead to an ever growing queue (over 500 jobs) because we always have L0 jobs in progress. This actually made the read path slower and applying the change introduced a noticeable improvement.

Although it seems that the change I am proposing goes against SJF, in reality new L0 jobs are always prioritized first, the change only bumps expired jobs ahead of other expired jobs. A queue per level will definitely be ideal, but this change actually gets us 95% of the way there (after fixing the bug with cancelled jobs).

This is what I'm going to implement in https://github.com/grafana/pyroscope/pull/3652 I'm working on. If this fix can wait, and you have no objections, I'd prefer to avoid merge conflicts.

I think for the sake of clarity #3652 should really focus on refactoring without functionality changes. The reason I addressed this now is because it impacted the read performance severely which I am working on in parallel to address. I see however that you're already working on this so I'll close this.

grafana / pyroscope

fix(v2): change compaction job priority order #3663

The issue

The fix

Migrations