The old code did the map in batches, manually, then called out to our own reduce library, which used ray.wait. This was a mess.
The new code has a general-purpose solution, where you define map, reduce, and zero methods on a "context" class, and all the rest is handled under the hood. In particular, rather than batching, we extend the original reduction strategy, of using ray.wait to pay attention to the number of pending tasks, and when it gets below the desired number, more map tasks are launched.
Thus, no batching, and we're always keeping as many tasks in flight, up to the desired ceiling, as possible.
All the tests pass, and it runs fine on localhost. Still need to try this on a big cluster. Also need a second pair of eyes to make sure I didn't do something stupid.
The old code did the map in batches, manually, then called out to our own reduce library, which used
ray.wait
. This was a mess.The new code has a general-purpose solution, where you define
map
,reduce
, andzero
methods on a "context" class, and all the rest is handled under the hood. In particular, rather than batching, we extend the original reduction strategy, of usingray.wait
to pay attention to the number of pending tasks, and when it gets below the desired number, more map tasks are launched.Thus, no batching, and we're always keeping as many tasks in flight, up to the desired ceiling, as possible.
All the tests pass, and it runs fine on localhost. Still need to try this on a big cluster. Also need a second pair of eyes to make sure I didn't do something stupid.