Open benvanik opened 1 year ago
Some models that include concats at each block:
SqueezeNet: https://tfhub.dev/tensorflow/lite-model/squeezenet/1/default/1 SpaghettiNet: https://tfhub.dev/iree/lite-model/ssd_spaghettinet_edgetpu_large_320/fp32/default/1
What data were you looking to gather? I can look into profiling them with Tracy or providing the mlir files, etc.
Hmm good question :)
Maybe we could do a regex scan over the mhlo/tosa intermediates - thinking that since we're looking for non-0 concat dimensions these should show whether a model in intermediate form is likely to trigger these cases: tosa\.concat.+axis = [1-9]+
mhlo\.concatenate.+dimension = [1-9]+
For SpaghettiNet, we have:
%312 = "tosa.concat"(%305, %308, %311) {axis = 3 : i64} : (tensor<1x80x80x64xf32>, tensor<1x80x80x64xf32>, tensor<1x80x80x64xf32>) -> tensor<1x80x80x192xf32>
%326 = "tosa.concat"(%316, %319, %322, %325) {axis = 3 : i64} : (tensor<1x40x40x96xf32>, tensor<1x40x40x96xf32>, tensor<1x40x40x96xf32>, tensor<1x40x40x96xf32>) -> tensor<1x40x40x384xf32>
%340 = "tosa.concat"(%333, %336, %339, %330) {axis = 3 : i64} : (tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>) -> tensor<1x40x40x512xf32>
%355 = "tosa.concat"(%345, %348, %351, %354) {axis = 3 : i64} : (tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>, tensor<1x40x40x128xf32>) -> tensor<1x40x40x512xf32>
%370 = "tosa.concat"(%360, %363, %366, %369) {axis = 3 : i64} : (tensor<1x40x40x64xf32>, tensor<1x40x40x64xf32>, tensor<1x40x40x64xf32>, tensor<1x40x40x64xf32>) -> tensor<1x40x40x256xf32>
%658 = "tosa.concat"(%522, %564, %612, %638, %657) {axis = 1 : i64} : (tensor<1x1200x1x4xf32>, tensor<1x600x1x4xf32>, tensor<1x150x1x4xf32>, tensor<1x54x1x4xf32>, tensor<1x54x1x4xf32>) -> tensor<1x2058x1x4xf32>
%665 = "tosa.concat"(%527, %569, %617, %643, %664) {axis = 1 : i64} : (tensor<1x1200x91xf32>, tensor<1x600x91xf32>, tensor<1x150x91xf32>, tensor<1x54x91xf32>, tensor<1x54x91xf32>) -> tensor<1x2058x91xf32>
For SqueezeNet, we have:
%61 = "tosa.concat"(%58, %60) {axis = 3 : i64} : (tensor<1x55x55x64xf32>, tensor<1x55x55x64xf32>) -> tensor<1x55x55x128xf32>
%68 = "tosa.concat"(%65, %67) {axis = 3 : i64} : (tensor<1x55x55x64xf32>, tensor<1x55x55x64xf32>) -> tensor<1x55x55x128xf32>
%75 = "tosa.concat"(%72, %74) {axis = 3 : i64} : (tensor<1x55x55x128xf32>, tensor<1x55x55x128xf32>) -> tensor<1x55x55x256xf32>
%83 = "tosa.concat"(%80, %82) {axis = 3 : i64} : (tensor<1x27x27x128xf32>, tensor<1x27x27x128xf32>) -> tensor<1x27x27x256xf32>
%90 = "tosa.concat"(%87, %89) {axis = 3 : i64} : (tensor<1x27x27x192xf32>, tensor<1x27x27x192xf32>) -> tensor<1x27x27x384xf32>
%97 = "tosa.concat"(%94, %96) {axis = 3 : i64} : (tensor<1x27x27x192xf32>, tensor<1x27x27x192xf32>) -> tensor<1x27x27x384xf32>
%104 = "tosa.concat"(%101, %103) {axis = 3 : i64} : (tensor<1x27x27x256xf32>, tensor<1x27x27x256xf32>) -> tensor<1x27x27x512xf32>
%112 = "tosa.concat"(%109, %111) {axis = 3 : i64} : (tensor<1x13x13x256xf32>, tensor<1x13x13x256xf32>) -> tensor<1x13x13x512xf32>
Cool - that's really useful - the first one is likely to be 7 memcpys and 22 serialized slow emulated memcpy dispatches instead of 0 + the transient memory required. Probably worth about 100-300us of latency on a GPU.
Hi, @benvanik
"We do properly handle outer dimension concats (they get turned into flow.tensor.update ops)."
I have tried for out-dim concat, it can successfully turn into flow.tensor.update. And flow.tensor.update will be eventually lowered to hal.command_buffer.copy_buffer, which turned into memcpyD2D in cuda. Explicit dispatch kernel for concat is eliminated, but still implicit kernel memcpyD2D is introduced.
Could we eliminate the copy_buffer? The concat source just reuse the concat destination buffer. Thanks.
my experiment case: func.func @add_concat() { %lhs = util.unfoldable_constant dense<1.0> : tensor<2x2xf32> %rhs = util.unfoldable_constant dense<1.0> : tensor<2x2xf32> %lhs2 = util.unfoldable_constant dense<1.0> : tensor<3x2xf32> %rhs2 = util.unfoldable_constant dense<1.0> : tensor<3x2xf32> %lhs3 = util.unfoldable_constant dense<1.0> : tensor<5x2xf32> %0 = mhlo.add %lhs, %rhs : tensor<2x2xf32> %1 = mhlo.add %lhs2, %rhs2 : tensor<3x2xf32> %2 = "mhlo.concatenate"(%0, %1) {dimension = 0 : i64} : (tensor<2x2xf32>, tensor<3x2xf32>) -> tensor<5x2xf32> %3 = mhlo.add %lhs3, %2 : tensor<5x2xf32> check.expect_almost_eq_const(%3,dense<3.0> : tensor<5x2xf32>): tensor<5x2xf32> return }
Nice! It's possible to eliminate the copy - it's an old TODO of mine that I should finally finish: #7729 Right now we even serialize those copies so it's pretty much as bad as you can get (well, still better than making a dispatch at least). I'll take a look at it ~next week!
Nice! It's possible to eliminate the copy - it's an old TODO of mine that I should finally finish: #7729 Right now we even serialize those copies so it's pretty much as bad as you can get (well, still better than making a dispatch at least). I'll take a look at it ~next week!
That's great. Looking forward to your update!
Hi, @benvanik
I have seen that the copy is eliminated for outer-dim concat. Thanks for your work.
And for inner-dim concat, instead of doing transpose to make it outer-dim (this requires extra transpose kernel), have you thought of directly eliminate outer-dim concat ? Conceptually, it requires that the source op of concat use stride write into the concat target buffer. To achieve this, the target buffer should be described as strided, and codegen use this information to generate memory write. There should be more challenges. Is it feasible?
thanks.
Was noticing that some concats are getting turned into very inefficient sequences of operations today (effectively gathers that get split into one dispatch per gathered input). It'd be good to survey models in the benchmark suite to see what percentage of dispatches are from concats and how they end up getting lowered. I suspect models with FFTs/complex numbers/image data may be most impacted as that's usually where interleaving happens. We do properly handle outer dimension concats (they get turned into
flow.tensor.update
ops).I wanted to get this tracked as it's something that won't stand out as a single slow dispatch in a % breakdown but instead add to the total dispatch count and latency. Models that have high latency, low utilization, and high dispatch counts should check to see if this is a cause. It'll be worse on GPUs where there is higher launch overhead. Maybe transform dialect will solve this - it'd be worth checking to see how these ops lower on that path.
Example from the
xla_ops/concatenate.mlir
test showing this:->
->
->
Because of the way the concat is turned into dispatches we end up doing 4 sequential operations (full barriers between) filling the memory with wasted bytes and then writing over it multiple times.
The hope would be that these inserts fuse into producers but they don't today:
->
->
So that's 6 dispatches and a fill with marginal concurrency for what should really be a single dispatch. In the past I've seen models that concatenate dozens of values and for us that'd explode to hundreds of really thin memcpy-like dispatches.
This case is a good one for algorithmic improvements at the linalg level: if we added transposes on each concatenated value then transposed back such that we could always be doing non-interleaved concats we could propagate/fuse everything away. That's definitely a kind of optimization that we'll want to do but orthogonal to handling insert slices better.