Planner for PipeDream-2BW

nict-wisdom commented 4 years ago

In pipedream_2bw branch, we found the runtime that implements PipeDream-2BW. However, no explanation is given about the planner.

Can we use the planner?

deepakn94 commented 4 years ago

You can use this function for planning: https://github.com/msr-fiddle/pipedream/blob/pipedream_2bw/planner/planner.py#L33.

For the performance and memory cost functions, you might want to use direct measurements (from running a 100 or so iterations for the respective configuration).

nict-wisdom commented 4 years ago

Thank you for your very quick answer!

I was wondering how I can get values for some arguments: computation_time_per_block, num_parameters_per_block, num_activations_per_block, and output_activation_size.

More specifically,

What are num_activations_per_block and output_activation_size?
From these arguments, you seem to assume that all blocks have the same values (same computation time, same number of parameters, etc). Is my understanding correct?

I appreciate if you could answer these questions.

deepakn94 commented 4 years ago

num_activations_per_block is the size of the intermediate activations needed in a transformer block during training. output_activation_size is the size of the intermediate activations sent between workers. Note that you can get these by profiling your model.

And yes, we're assuming that these are transformer models where the transformer blocks are repeated some number of times.

nict-wisdom commented 4 years ago

Thank you again for your kind support! I understand that PipeDream-2BW assumes uniform layers.

I have a related question about PipeDream (the former version). From my understanding from the paper, PipeDream can allocate different numbers of GPUs to stages (unlike PipeDream-2BW). My question is whether the implementation supports such allocations.

When I try, the optimizer (optimizer_graph_hierarchical.py) actually produces such allocations. However, the runtime is often blocked with such an allocation. (One of the reasons is the gradient synchronization among processes in the same stage, but there must be some other reasons) Moreover, I found the following comment:

TODO: don't current support uneven configurations.

Does the uneven configurations mean allocating different numbers of GPUs to stages?

When I set a certain amount of GPUs (8/16/32) to train resnet, most of generated configurations are blocked soon after training starts. Could you tell me how we can solve it, or is it possible generate safe configurations?

barrydoooit commented 2 years ago

You can use this function for planning: https://github.com/msr-fiddle/pipedream/blob/pipedream_2bw/planner/planner.py#L33.

For the performance and memory cost functions, you might want to use direct measurements (from running a 100 or so iterations for the respective configuration).

May I know if this is still available for non-commercial usage now?

msr-fiddle / pipedream

Planner for PipeDream-2BW #57