iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 573 forks source link

Reusable pattern for folding static dimensions into shape-aware ops. #8441

Open benvanik opened 2 years ago

benvanik commented 2 years ago

Though a majority of the cases where dimensions become static happen early in the pipeline there are some that happen after lowering into flow and stream.tensor.*. It'd be really nice to be able to propagate this static information before lowering into the stream dialects.

These come from consuming tensors whose shape may be knowable:

%0 = tensor.cast %input : tensor<1x2xf32> -> tensor<?x2xf32>
%dim = tensor.dim %0, %c1 : tensor<?x2xf32>
%1 = flow.tensor.clone %0 : tensor<?x2xf32>{%dim}

->

%0 = tensor.cast %input : tensor<1x2xf32> -> tensor<?x2xf32>
%1 = flow.tensor.clone %0 : tensor<?x2xf32>{%c1}

And can also happen with consumers:

%dim = tensor.dim %input, %c1 : tensor<?x2xf32>
%0 = flow.tensor.clone %input : tensor<?x2xf32>{%dim}
%1 = tensor.cast %0 : tensor<?x2xf32> -> tensor<1x2xf32>

I'm not sure there's anything today that replaces the dim if there's a subsequent cast, but that'd be useful!:

%0 = flow.tensor.clone %input : tensor<?x2xf32>{%c1}
%1 = tensor.cast %0 : tensor<?x2xf32> -> tensor<1x2xf32>

There's a few approaches here with the most robust being to make the Util_ShapeAwareOp interface support recreating the op with new static shape dimensions. A canonicalization pattern registered on the interface could then check operands/results for cases where more static information is available and recreate the op with that. Another approach would be to expose mutable fields on the interface but that gets messy - there's only a dozen ops and it'd be easier to just rebuild them.

Ideally we'd then end up with something that turned the casts into clones:

%0 = tensor.cast %input : tensor<1x2xf32> -> tensor<?x2xf32>
%1 = flow.tensor.clone %0 : tensor<?x2xf32>{%c1}

->

%0 = flow.tensor.clone %input : tensor<1x2xf32> -> tensor<1x2xf32>
%1 = flow.tensor.clone %0 : tensor<1x2xf32>

and

%0 = flow.tensor.clone %input : tensor<?x2xf32>{%c1}
%1 = tensor.cast %0 : tensor<?x2xf32> -> tensor<1x2xf32>

->

%0 = flow.tensor.clone %input : tensor<1x2xf32>
%1 = flow.tensor.clone %0 : tensor<1x2xf32> -> tensor<1x2xf32>

Since in tensor form the cast may be forking an immutable tensor value the clone preserves the behavior (for example if there were two cast ops consuming the same tensor) - clone canonicalizers can then kick in and handle the rest of the cleanup.

Examples of where this can arise and not be caught earlier are any IPO we do (across globals/function/branch boundaries), const eval that ends up introducing more static information, or specialization where we branch off paths that are say size 1 and size N and want to optimize the size 1 half. #8441 would also benefit as we can get more static 0's and kill more ops.

p3achyjr commented 11 months ago

Hey! If no one has taken this, could I have a go?