microsoft / rushstack

Monorepo for tools developed by the Rush Stack community
https://rushstack.io/
Other
5.74k stars 587 forks source link

[rush] unassigned operations can ignore weighting constraints #4821

Open aramissennyeydd opened 1 week ago

aramissennyeydd commented 1 week ago

Summary

We're dealing with some build cache inconsistencies across cobuild agents. We're seeing agents using the same lock, but unable to restore completed state as they don't have the same build cache ID. This bug is allowing 2 expensive operations to run side-by-side causing memory issues and timeouts related to memory pressure.

This PR uses the remote executing operation when reasonable and moves the unassigned status to just denote a sleep. It also improves the remote executing evaluation, since we no longer greedily get an event during execution time, we need to move that complexity into the scheduler. I don't love how much I've added, this could be reduced to a few lines if we add a new property to OperationExecutionRecord which also didn't feel like the right path.

Details

From what I can tell, the unassigned operation needed to have a weight matching the possible operation it will pick up as https://github.com/microsoft/rushstack/blob/c1effc398416068b7f276bac32044f775643295d/libraries/rush-lib/src/logic/operations/OperationExecutionManager.ts#L265 can start an operation that it reasons was not completed on another agent that locked it initially in the CacheableOperationPlugin, https://github.com/microsoft/rushstack/blob/c1effc398416068b7f276bac32044f775643295d/libraries/rush-lib/src/logic/operations/CacheableOperationPlugin.ts#L388-L398. If the weight is not reflected onto the unassigned operation, multiple operations can be picked up and will skip the weight check. For example, say you have 3 operations with parallelism set to 8 and 2 machines:

  1. Operation A with weight 8
  2. Operation B with weight 8
  3. Operation C with weight 4

Start: Machine 1 picks up Operation A, Machine 2 picks up Operation B Step 1: Machine 1 finishes but fails to mark complete Operation A, Machine 2 finishes Operation B Step 2: Machine 2 picks up Operation A and Operation C

The finishes but fails to mark complete is possible with a build cache ID inconsistency (what we're seeing) or if a machine gets lost during execution and doesn't report its success state, so another machine picks up the operation.

Removing the ability to assign a remote executing operation to a unassigned operation allows the weight to be correctly determined based on the specific operation that is being picked up.

How it was tested

I tested this with the sharded repo and made sure that both cobuilds aren't deadlocking and that sharding is still working and working effectively.

\nteresting sidenote, this change may open up the ability to do more dynamic wait times, which could help improve overall run times for agents that spend a lot of time in the sleep loop. Adjusting the sleep from 5s -> 1s dropped times from 40s -> 28s.

Impacted documentation

None.

chengcyber commented 3 days ago

If I understand it correctly, the current challenge is to consistently reproduce the same schedule across the different cobuild processes. In the current lightweight implementation, there is no centralized scheduler to control which runner will be assigned to run a task.

To tackle with this, an initiative thought comes into my mind are:

  1. Introduce a centralized scheduler The centralized scheduler introduces a much more complicated and powerful architecture to distributed execution. The drawback is that it goes a little bit against the designated lightweight approach for cobuild.

  2. Each runner can record how the tasks get scheduled, and cobuilds can reproduce the scheduler based on a log-based approach.

The workflow for 2 could be: a: Turn on generating schedule logs for every runner by specifying a CLI parameter or ENV. (can be a default behaviour) Running RUSH_COBUILD_SCHEDULE_LOG=1 rush cobuild --to <package_name> writes a log file somewhere, let's say common/temp/cobuild.schedule.log

b: Reproduce the schedule log by specifying a CLI parameter or ENV. Running RUSH_COBUILD_REPRODUCE_BY_SCHEDULE_LOG=1 rush cobuild -to <package_name>, Rush instead of scheduling by AsyncOperationQueue, it invokes a similar OperationQueue instance (maybe ReproducibleAsyncOperationQueue) to accurate assign the task in a predefined order based on the common/temp/cobuild.schedule.log says.


Back to this PR, weight property is actually in the same direction for Option 2. But it has some drawbacks I can tell:

  1. Manually managing on weight setup is sort of loose and error-prune. A better approach is generated the "weight"(the generic idea I mean) from the actual run, such as schedule.log I proposed here.
  2. It's not easy to use. For a cobuild runs a lot of tasks, weight setup comes trivial to use.
aramissennyeydd commented 3 days ago

@chengcyber I posted about this on Zulip, I think my path forward is to get rid of the unassigned status and move the sleep toCacheableOperationPlugin, that way we no longer need to interact with the scheduling loop at all or assign weights.

I don't think we need a central scheduler at this point, we're just experiencing some growing pains with cobuilds and trying to lessen the pain. I'm working on a plugin at the moment to detect cache entry drift which is the other issue we're seeing.