During @vishalgupta97 internship we observed problems with the work stealing algorithm that if all the work was generated by a single scheduler thread, then the other threads had heavy contention in the stealing code. This is due to the stealing code only stealing a single item at a time.
There are a few papers that suggest stealing in larger units than single work items:
The suggestions is that stealing at most half of the work is sensible.
This PR changes the implementation of the scheduler queues to support a steal all operation. By providing N (N=4 in the PR) queues per scheduler thread, then we are able to steal approximately 1/4 of the work on a scheduler thread.
The PR adds a very simple benchmark where a single behaviour generates all the work in the system. Configured with 4 cores, we get
Before this PR:
Scheduled all work took:
317 ms
Elapsed:
606 ms
While with his PR:
Scheduled all work took:
234 ms
Elapsed:
251 ms
This is running on my laptop with just four cores, so bigger experiments should be undertaken.
There is also a lot of opportunity for further investigations, in terms of different strategies for stealing and scheduling work. This PR represents a simple position that improves the state of the system.
During @vishalgupta97 internship we observed problems with the work stealing algorithm that if all the work was generated by a single scheduler thread, then the other threads had heavy contention in the stealing code. This is due to the stealing code only stealing a single item at a time.
There are a few papers that suggest stealing in larger units than single work items:
The suggestions is that stealing at most half of the work is sensible.
This PR changes the implementation of the scheduler queues to support a steal all operation. By providing N (N=4 in the PR) queues per scheduler thread, then we are able to steal approximately 1/4 of the work on a scheduler thread.
The PR adds a very simple benchmark where a single behaviour generates all the work in the system. Configured with 4 cores, we get
Before this PR:
While with his PR:
This is running on my laptop with just four cores, so bigger experiments should be undertaken.
There is also a lot of opportunity for further investigations, in terms of different strategies for stealing and scheduling work. This PR represents a simple position that improves the state of the system.