[RFC][Discussion] Automatic Parallelization

soodoshll commented 1 year ago

rendered rfc

yaoyaoding commented 1 year ago

Hi @soodoshll, thanks for the draft!

It looks good as a first verion rfc draft!

I have several suggestions:

The 0001 and 0002 rfc slot has been used. Might consider use 0003.
Might be better to give a concrete example of distributed_config and optimization_config in the guide-level explanation.
Reference-level explanation is a good place to show what configs are available for distributed_config and optimization_config, and what each config specify. It's okay only to put the ones that are known now and update the draft and add more configs during implementation and refactor in the future.
I prefer putting the out_dir as a seperate function parameter instead of an attribute in config.
Add a reference to Alpa and use something like "Alpa[1]" in the main text.
This part "For example, for a 4x4 multi-machine-multi-GPU cluster, the possible sharding specifications are (4x4, 1), (4(machine), 4(gpus)), (4(gpus), 4(machines)), (1, 4x4). We do not consider (2, 8) or (8, 2). Therefore, using R or Si is sufficient since the number of shards is determined by the number of devices. " is a little vague. We can consider adding some example to illustrate what does a specific sharding specification mean (e.g., (4 gpus, 4 machine)), and explain the meaning of "R" and "Si".
Consider using mesh_axes_per_dim in TensorShardSpec.
The math formula in "Operator Sharding Specification" has some typesetting flaws.

yaoyaoding commented 1 year ago

Could add a section to describe the ILP formulation.

The design looks good to me. Hi @soodoshll and @xinli-git, could you also discuss how to seperate the whole feature into relative small steps to implement? We can use this issue to track the PRs related to this RFC, something like https://github.com/apache/tvm/issues/15319. Thanks!

soodoshll commented 1 year ago

Hi @yaoyaoding, thanks for your suggestions. I've fixed the draft.

The whole features can be decomposed into the following steps:

Design and implement the data structure for tensor and op sharding specifications
connect function, which relies on (1)
Sharding rule generation, which relies on (1)
weight sharding and comm op injection, which relies on (2)
auto-parallelization algorithm, which relies on (2) and (3)
Run end-to-end tests

I'm working on 1 after it is done, we can start 2 and 3. I have a prototype of 3, which I will integrate later.

Hi @xinli-git, let's work in the auto-parallel branch.

soodoshll commented 1 year ago

I found that resharding (tensor conversion between ops with different specifications) sometimes requires the collective communication primitive all-to-all. For example, it happens when a MxN matrix is sharded along axis M and we want to convert it to be sharded along axis N.

Though nccl does not directly supports all-to-all, it can be implemented by send and recv. Without all-to-all, a workaround is to use all-gather and then do slicing for the same purpose, though suffering from suboptimal performance.

I'd suggest treat it as a low-prioritized TODO item and see if it will really cause performance issue. We can fix it after finishing the backbone of the whole pipeline.

xinli-git commented 1 year ago

Thanks! @soodoshll. The RFC is very detailed.

For modelling computation, it seems that Alpa assumes that all tensor contraction OPs (MM, Conv) must be fully sharded so all such ops that same computation cost under different sharding strategies. They also observe that other OPs have negligible runtime cost for computation. (I verified this as well). As a result, they think there was no need to model computation.

Since this feature probably requires a month of work for multiple people (currently me and Qidong) I was thinking maybe we can leverage github Projects (https://github.com/hidet-org/hidet/projects?query=is%3Aopen)

@yaoyaoding if you think that's a good idea I will take a lead on this

yaoyaoding commented 1 year ago

Hi @xinli-git, sounds good to me. I have not used the github project feature before, but you can have a try and let's see whether it helps the orgnization and planning.

hidet-org / hidet

[RFC][Discussion] Automatic Parallelization #335