[Optimization] Splitting FC weights to facilitate distribution of compute across multiple computational units

rdzhabarov commented 5 years ago

Problem:

Given an FC operator with the input of [M;K], weights of [K;N] and bias of [N] dimensions. For certain ASIC architectures it's beneficial to split weights into several chunks and distribute it over multiple processing units. This way we could fit weights into SRAM, take advantage of parallel execution and reduce overall latency.

Proposed changes:

Let's consider the scenario when weights are split into two chunks: [K;N/2], [K;N/2] and distributed over two computational units during the compilation time (with the help of partitioner). Bias with the [N/2] sub tensors correspondingly. In this case:

input activations of size [M;K] needs to be copied to those two processing units during the execution.
Perform 2 FCs, input [M;K], weights [K;N/2], bias [N/2]
Concat results of [M;N/2] into the final tensor of size [M;N]

From Glow perspective this could be expressed as a general graph level optimizations (followed by partitioner) which can be enabled/disabled by backends if needed. Optimization needs to have a parameter specifying number of chunks to split weights and bias (mostly driven by number of computation units). As well as size of N starting from which the optimization needs to be performed. Both parameters are architecture dependent and might need to be tuned, for now we could make those as a constant and later figure out the way to set it on per backend basis.

Original idea is coming from @nrsatish, please, add any details/comments if any. cc: @bertmaher @arunm-git @beicy

nrsatish commented 5 years ago

Hi folks - I am envisioning this as not being used only in the partitioner. We may also need it even within a single function (for a single device). So what I was thinking here was a graph optimization pass where we take a single FC node and replace with two FC nodes and a final Concat node.

rdzhabarov commented 5 years ago

I am envisioning this as not being used only in the partitioner. We may also need it even within a single function (for a single device)

That's the plan, we'll make it so it's a standalone optimization, but partitioner could get use of already splitted FCs.

narayanan2004 commented 5 years ago

Perhaps we should tag the split nodes (not sure what mechanism we have in GLOW beyond naming the nodes in a predefined pattern) to make it easy on the partitioner to detect splits and assign them to different devices.

rdzhabarov commented 5 years ago

Perhaps we should tag the split nodes (not sure what mechanism we have in GLOW beyond naming the nodes in a predefined pattern) to make it easy on the partitioner to detect splits and assign them to different devices.

Some mechanism like that might ease partitioner job a bit, but for partitioner it should be trivial, find a concat and analyze the input FCs to the concat.

beicy commented 5 years ago

Perhaps we should tag the split nodes (not sure what mechanism we have in GLOW beyond naming the nodes in a predefined pattern) to make it easy on the partitioner to detect splits and assign them to different devices.

"assign the split nodes to different devices" -- is that always true?

rdzhabarov commented 5 years ago

I think we can tackle this in two steps: 1) Make general graph optimization 2) Enhance partitioner depending on experimentation results.

narayanan2004 commented 5 years ago

Perhaps we should tag the split nodes (not sure what mechanism we have in GLOW beyond naming the nodes in a predefined pattern) to make it easy on the partitioner to detect splits and assign them to different devices.

"assign the split nodes to different devices" -- is that always true?

I think that is the rationale behind enabling this specific FC split heuristic (to optimize for SRAM on multiple devices). My comment was mostly on how to make the detection of these splits easier in the partitioner phase (the partitioner has the freedom to decide what to do with them) and not have to write any split-detection logic later on.

rdzhabarov commented 5 years ago

I'll get the (1) going this week.

pytorch / glow

[Optimization] Splitting FC weights to facilitate distribution of compute across multiple computational units #3462

Problem:

Proposed changes: