Closed asraa closed 1 week ago
@j2kun very kindly reminded me we can just multi-dim the tensors
(1) Flatten the tensor: e.g. if we have a tensor polynomials, each of which are lowered to
tensor<1024xi64>
, then a size 2 tensor would becometensor<2048xi64>
. However, this would appear to severely complicate the lowering math.
The tensor<2x1024xi64>
seems to be more intuitive than tensor<2048xi64>
.
Yep :) multi-dim it is. Haha it totally slipped my mind. I'll update with some code, I believe I need to add a type converter for ranked tensors, and then handle some tensor ops.
What's really strange is there's no interface pattern for this :/ which makes it really weird. Even if the type converter exists, I need to add support for converting ops like tensor.from_elements
- I believe some conversions write a basic catch-all op converter that just changes types for these ops.
Let's chat a bit about the type conversion when you get back to this. I don't think the type conversion should be particularly hard...
From our HEIR meeting today I suggested we should be able to find a way to use upstream passes to map elementwise operations to loops, perhaps via linalg.
@AlexanderViand-Intel could you post your sample IR?
Sure, here's a minimal example, something like this also appears when lowering tests/bgv/to_polynomial.mlir
:
func.func @test_bin_ops(%arg0: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, %arg1: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%0 = polynomial.add(%arg0, %arg1) : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
return %0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
Trying to lower this with -polynomial-to-standard
with the pass "as-is" (as of 04f6106b42e99a04f9382c71f855929f0f3a280f) simply crashes on this cast in ConvertAdd
because the pass assumes the operands are polynomials but here they of course aren't.
Since the whole idea of traits like ElementwiseMappable
is that we shouldn't have to re-invent the wheel, I tried applying the existing -convert-elementwise-to-linalg
pass to the example, which results in:
#map = affine_map<(d0) -> (d0)>
func.func @test_bin_ops(%arg0: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, %arg1: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%0 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) outs(%arg0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) {
^bb0(%in: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>, %in_0: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>, %out: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>):
%1 = polynomial.add(%in, %in_0) : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
linalg.yield %1 : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
} -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
return %0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
However, from here I'm a bit stuck, none of the passes I tried would further simplify this in a meaningful way (e.g., convert it to loop).
Note that applying -polynomial-to-standard
on the linalg.generic
version as-is crashes because of unresolved materializations around yield, but adding builtin.unrealized_conversion_cast
as materializers lets you run the pass and see what's going on a bit better:
#map = affine_map<(d0) -> (d0)>
func.func @test_bin_ops(%arg0: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, %arg1: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%0 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) outs(%arg0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) {
^bb0(%in: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>, %in_0: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>, %out: !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>):
%1 = builtin.unrealized_conversion_cast %in : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%2 = builtin.unrealized_conversion_cast %in_0 : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%cst = arith.constant dense<33538049> : tensor<1024xi26>
%3 = arith.extsi %1 : tensor<1024xi25> to tensor<1024xi26>
%4 = arith.extsi %2 : tensor<1024xi25> to tensor<1024xi26>
%5 = arith.addi %3, %4 : tensor<1024xi26>
%6 = arith.remsi %5, %cst : tensor<1024xi26>
%7 = arith.trunci %6 : tensor<1024xi26> to tensor<1024xi25>
%8 = builtin.unrealized_conversion_cast %7 : tensor<1024xi25> to !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
linalg.yield %8 : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
} -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
return %0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
From here, we could think about how these builtin.unrealized_conversion_cast
s should be resolved and might arrive at something like this (manually adapted from the above) with multi-dimensional tensors:
#map = affine_map<(d0) -> (d0)>
func.func @test_bin_ops(%arg0: tensor<2x1024xi25>, %arg1: tensor<2x1024xi25>) -> tensor<2x1024xi25> {
%0 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<2x1024xi25>, tensor<2x1024xi25>) outs(%arg0 : tensor<2x1024xi25>) {
^bb0(%in: tensor<1024xi25>, %in_0: tensor<1024xi25>, %out: tensor<1024xi25>):
%cst = arith.constant dense<33538049> : tensor<1024xi26>
%3 = arith.extsi %in : tensor<1024xi25> to tensor<1024xi26>
%4 = arith.extsi %in_0 : tensor<1024xi25> to tensor<1024xi26>
%5 = arith.addi %3, %4 : tensor<1024xi26>
%6 = arith.remsi %5, %cst : tensor<1024xi26>
%7 = arith.trunci %6 : tensor<1024xi26> to tensor<1024xi25>
linalg.yield %7 : tensor<1024xi25>
} -> tensor<2x1024xi25>
return %0 : tensor<2x1024xi25>
}
Unfortunately, this doesn't work because linalg.generic
expects to be "looping" (conceptually) over all the dimensions: error: 'linalg.yield' op type of yield operand 1 ('tensor<1024xi25>') doesn't match the element type of the enclosing linalg.generic op ('i25')
. Of course, we could flatten tensor<2x1024xi25>
into tensor<2048xi25>
but that seems like a lot of work/boilerplate for something pretty messy.
I think if we had a pass that converted elementwise mappable things to affine.for
instead of linalg.generic
, which might look a bit like this:
#map = affine_map<(d0) -> (d0)>
func.func @test_bin_ops(%arg0: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, %arg1: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%0 = affine.for %i = 0 to 2 iter_args(%a = %arg0) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%in = tensor.extract %arg0[%i] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%in_0 = tensor.extract %arg1[%i] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%1 = polynomial.add(%in, %in_0) : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
%t = tensor.insert %1 into %a[%i] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
affine.yield %t : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
return %0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
We could then handle the type conversion gracefully with multi-dimensional tensors. I haven't yet found a pass that would do this (either going through linalg.generic
or directly), but I've also not spent as much time with the "standard" part of MLIR as @j2kun and @asraa so maybe you've got an idea here.
@AlexanderViand-Intel https://github.com/google/heir/pull/504 should show you how to convert to affine loops, though it still runs into the problem of a failed type conversion in --polynomial-to-standard
, which I suspect is an unrelated bug.
@AlexanderViand-Intel #504 should show you how to convert to affine loops, though it still runs into the problem of a failed type conversion in
--polynomial-to-standard
, which I suspect is an unrelated bug.
Even my "manually converted" affine loop thing above runs into failed type conversion issues, which I think is because the type converter doesn't realize that it can convert from tensor<2x!poly.poly<...>>
to tensor<2x1024xi25>
(or whatever the poly in the tensor happens to convert to, so #505 might not be entirely unrelated?
I see what you're saying, but I would expect it to just type-convert the content of the memref opaquely. Maybe the problem is that you can't have a memref of tensors? Anyhow, if you have time to root-cause that type conversion issue, that'd be helpful. Otherwise as least we have a clear path to convert elementwise to loops now :)
I don't think it's memref related, because lowering my example above (no memrefs, just affine.for) with the -polynomial-to-lower
gives similar errors, and using the "augmented" version (with unrealized conversion cast) results in this:
func.func @test_bin_ops(%arg0: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>, %arg1: tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) -> tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> {
%0 = affine.for %arg2 = 0 to 2 iter_args(%arg3 = %arg0) -> (tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) {
%extracted = tensor.extract %arg0[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%1 = builtin.unrealized_conversion_cast %extracted : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%extracted_0 = tensor.extract %arg1[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%2 = builtin.unrealized_conversion_cast %extracted_0 : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%cst = arith.constant dense<33538049> : tensor<1024xi26>
%3 = arith.extsi %1 : tensor<1024xi25> to tensor<1024xi26>
%4 = arith.extsi %2 : tensor<1024xi25> to tensor<1024xi26>
%5 = arith.addi %3, %4 : tensor<1024xi26>
%6 = arith.remsi %5, %cst : tensor<1024xi26>
%7 = arith.trunci %6 : tensor<1024xi26> to tensor<1024xi25>
%8 = builtin.unrealized_conversion_cast %7 : tensor<1024xi25> to !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
%inserted = tensor.insert %8 into %arg3[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
affine.yield %inserted : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
return %0 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
If I add an explicit typeconversion then we get a bit closer (note the func has been handled properly):
func.func @test_bin_ops(%arg0: tensor<1024x2xi25>, %arg1: tensor<1024x2xi25>) -> tensor<1024x2xi25> {
%0 = builtin.unrealized_conversion_cast %arg1 : tensor<1024x2xi25> to tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%1 = builtin.unrealized_conversion_cast %arg0 : tensor<1024x2xi25> to tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%2 = affine.for %arg2 = 0 to 2 iter_args(%arg3 = %1) -> (tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>) {
%extracted = tensor.extract %1[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%4 = builtin.unrealized_conversion_cast %extracted : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%extracted_0 = tensor.extract %0[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
%5 = builtin.unrealized_conversion_cast %extracted_0 : !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>> to tensor<1024xi25>
%cst = arith.constant dense<33538049> : tensor<1024xi26>
%6 = arith.extsi %4 : tensor<1024xi25> to tensor<1024xi26>
%7 = arith.extsi %5 : tensor<1024xi25> to tensor<1024xi26>
%8 = arith.addi %6, %7 : tensor<1024xi26>
%9 = arith.remsi %8, %cst : tensor<1024xi26>
%10 = arith.trunci %9 : tensor<1024xi26> to tensor<1024xi25>
%11 = builtin.unrealized_conversion_cast %10 : tensor<1024xi25> to !polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>
%inserted = tensor.insert %11 into %arg3[%arg2] : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
affine.yield %inserted : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>>
}
%3 = builtin.unrealized_conversion_cast %2 : tensor<2x!polynomial.polynomial<<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>>> to tensor<1024x2xi25>
return %3 : tensor<1024x2xi25>
}
but it's still not quite there, because the affine.for
isn't converted. I think this is because it's not part of the operations that addStructuralConversionPatterns
handles. Do you know if there's an equivalent for affine, or do we need to manually mark it as a dynamically legal op?
I think I have an idea of how to handle this (and also #505), WIP here. It nearly works, but right now only with --verify-each=false
because it produces invalid tensor.extract
and affine.load
operations.
The reason is that when we go from extracting a poly from tensor<2x!polynomial.polynomial<..>>
to triyng the same from a tensor<2x1024xi25>
, we need to switch from tensor.extract
to tensor.slice
(and ditto for affine.load
to affine.load_vector
). I'll put together a PR later today.
PS: this seems like it would pop up nearly as frequently as the func.func
/control-flow conversion needs, so I wouldn't be surprised if there's already something like addStructuralConversionPatterns
but for containers - I just couldn't find one when I looked around the upstream examples.
PPS: 🤦 addStructuralConversionPatterns
is a HEIR thing, not an MLIR thing, so no wonder there's no equivalent (yet!)
The "tensor of tensor" PR (#508) lays the groundwork for supporting poly operations on tensors, but it's not a full end-to-end solution because it doesn't support the code generated by the --convert-elementwise-to-linalg
chain (c.f. #504), so even with that we still can't quite lower a polynomial.add %t1, %t2 : tensor<2x!polynomial.polynomial>
.
Instead of adding "memref of memref" helpers and then trying to figure out how to elegantly handle the raising from memref
back to tensor
(c.f. #505) I figured it might actually be easier to just add a --convert-elementwise-to-affine
pass, which seems like a generally useful thing to have, so I started a draft PR for that: #524
Reopening just to clean up the remaining comments
Digging this back up, as I noticed (as part of starting on #559) that poly.ntt
and poly.intt
aren't ElementwiseMappable
and, in fact, can't be, because ElementwiseMappable
requires that all non-scalar types match, and a "tensorized" version of an NTT might look something like this:
#ring = #polynomial.ring<cmod=33538049, ideal=#polynomial.polynomial<1 + x**1024>>
%0= polynomial.ntt %some_poly : !polynomial.polynomial<#ring> -> tensor<1024xi64, #ring> // normal NTT
%1= polynomial.ntt %some_poly_tensor : tensor<2x!polynomial.polynomial<#ring>> -> tensor<2x1024xi64, #ring> // tensor NTT
and so the ElementwiseMappable verifier will complain that "all non-scalar operands/results must have the same shape and base type". In the future, it might be interesting to start a conversation upstream about the possibility of expanding ElementwiseMappable, but for now I'll just avoid adding the trait to the op and instead add a todo for custom patterns for poly.ntt
and poly.intt
to our elementwise-to-affine pass.
I'm re-opening this issue, as the problem mentioned above appeared again as part of #763 and therefore needs attention.
@mr0re1 brought up a great review comment about lowering tensors of poly's to standard. https://github.com/google/heir/pull/134/files#r1312057016
Tensors of polys arise from BGV lowerings (and likely all other scheme lowerings to poly). When lowering to standard, we convert poly's to a tensor of ints.
Tensors of tensors are disallowed in MLIR - they are invalid tensor element types.
We have a few options here:
(1) Flatten the tensor: e.g. if we have a tensor polynomials, each of which are lowered to
tensor<1024xi64>
, then a size 2 tensor would becometensor<2048xi64>
. However, this would appear to severely complicate the lowering math.(2) Detensorize/lower to linalg and then lower to standard? E.g. inputs to
PolyToStandard
would not includetensors
.(3) It feels like I'm missing something - surely type conversion with tensors has been done before?