llvm.noalias attribute on Memref arguments is not properly lowered

tensorflow / mlir

"Multi-Level Intermediate Representation" Compiler Infrastructure

1.74k stars 260 forks source link

llvm.noalias attribute on Memref arguments is not properly lowered #309

Open dcaballe opened 4 years ago

dcaballe commented 4 years ago

We've been using 'llvm.noalias' attribute on Memrefs to propagate no-alias information of function arguments from nGraph to LLVM. This enabled some LLVM optimizations like vectorization and prevented unnecessary runtime checks on pointers, which were crucial for performance. Unfortunately, 'llvm.noalias' is not properly lowered after latest changes in Memref type. We are observing significant performance drops because of this.

Initially, Memref descriptor in LLVM was just a pointer to the allocated data for that Memref so 'llvm.noalias' was propagated to that pointer. However, currently Memref descriptor in LLVM is a struct as follows:

// template <typename Elem, size_t Rank>
// struct {
//   Elem *allocatedPtr;
//   Elem *alignedPtr;
//   int64_t offset;
//   int64_t sizes[Rank]; // omitted when rank == 0
//   int64_t strides[Rank]; // omitted when rank == 0
// };

and 'llvm.noalias' is propagated to the pointer to Memref descriptor (struct) instead of to the actual pointers to data (allocatedPtr and alignedPtr).

Example:

func @main(%arg0 : memref<16xf32>{llvm.noalias = true}) {
  return
}

mlir-opt noalias.mlir -convert-std-to-llvm
module {
  llvm.func @main(%arg0: !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }*"> {llvm.noalias = true}) {
    %0 = llvm.load %arg0 : !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }*">
    llvm.return
  }
}

mlir-opt noalias.mlir -convert-std-to-llvm | mlir-translate --mlir-to-llvmir
define void @main({ float*, float*, i64, [1 x i64], [1 x i64] }* noalias %0) {
  %2 = load { float*, float*, i64, [1 x i64], [1 x i64] }, { float*, float*, i64, [1 x i64], [1 x i64] }* %0
  ret void
}

We would need no-alias information to be propagated to allocatedPtr/alignedPtr in Memref descriptor. I'm not sure what would be the best way to do that since noalias is a function argument attribute that cannot be applied to struct fields and allocatedPtr and alignedPtr do alias now. We would probably have to use alias scopes but I'm not very familiar with that. Any thoughts?

Thanks, Diego

bondhugula commented 4 years ago

I think this is a crucial issue - thanks for bringing this up. I wish you actually had a test case upstream in test/Target/llvmir.mlir exercising propagation of llvm.noalias as you show below --- the issue would have been exposed earlier.

We've been using 'llvm.noalias' attribute on Memrefs to propagate no-alias information of function arguments from nGraph to LLVM. This enabled some LLVM optimizations like vectorization and prevented unnecessary runtime checks on pointers, which were crucial for performance. Unfortunately, 'llvm.noalias' is not properly lowered after latest changes in Memref type. We are observing significant performance drops because of this.

Initially, Memref descriptor in LLVM was just a pointer to the allocated data for that Memref so 'llvm.noalias' was propagated to that pointer. However, currently Memref descriptor in LLVM is a struct as follows:
// template <typename Elem, size_t Rank>
// struct {
//   Elem *allocatedPtr;
//   Elem *alignedPtr;
//   int64_t offset;
//   int64_t sizes[Rank]; // omitted when rank == 0
//   int64_t strides[Rank]; // omitted when rank == 0
// };
and 'llvm.noalias' is propagated to the pointer to Memref descriptor (struct) instead of to the actual pointers to data (allocatedPtr and alignedPtr).

Using the llvm noalias function parameter attribute was really a simple solution to get away with. :) This now requires a systematic solution, which I think was anyway inevitable. I think alias.scope metadata will now be needed on the generated load/stores with self referencing domain scopes. The name and meaning of llvm.noalias could be changed to convey propagation via indirection: llvm.noalias -> llvm.nomemrefalias as the first step. I tend to think the desired goal is easier/cleaner to accomplish in a pass on the LLVM dialect: the pass would process the nomemrefalias attribute, and mark load/stores to use alias.scope metadata while discarding the nomemrefalias attribute at the end. OTOH, trying to do this during the std to llvm dialect lowering would require introducing state to consistently/uniquely set scope domain IDs during store/load op lowering by tracking the origin of its memref descriptor values.

~ Uday

dcaballe commented 4 years ago

Thanks, Uday! Yes, it was a temporary (but very convenient) solution :). Agree it's very unfortunate that we didn't have testing coverage for that in MLIR. This has become a blocker in terms of performance for us.

Your proposed approach sounds reasonable to me. I was wondering if we currently have support to represent metadata in LLVM-IR dialect in some way. Otherwise, translating "markers" (attributes?) on LLVM-IR dialect loads/stores to proper alias domains using metadata in LLVM-IR might not be as a straightforward 1:1 mapping as the rest of the translation. Any thoughts on this?

I'm going to be OOO for a couple of days. I could give this a thought when I'm back if nobody has started already. However, I think I'll need some help.

Thanks, Diego

bondhugula commented 4 years ago

Your proposed approach sounds reasonable to me. I was wondering if we currently have support to represent metadata in LLVM-IR dialect in some way. Otherwise, translating "markers" (attributes?) on LLVM-IR dialect loads/stores to proper alias domains using metadata in LLVM-IR might not be as a straightforward 1:1 mapping as the rest of the translation. Any thoughts on this?

I don't think we have examples of representing metadata like LLVM scope domains in MLIR. Also, just translating op attributes to LLVM instruction metadata won't be sufficient here since the alias metadata would have out of line definitions as shown below that we may want to be printed in a similar way with the LLVM dialect.

; scope domain
!0 = !{!0}
!1 = !{!1}

; scope list
!2 = !{!0, !1}

%0 = load float, float* %c, align 4, !alias.scope !5

In fact, if we want something that looks close to the printed LLVM IR (out of line definitions for the domains), a possible approach with MLIR would be to use 0-d constant affine maps. AFAIK, they are the only thing (besides integer sets) which have out of line definitions/IR printing. All the infra needed is also already in the IR. 0-d constant affine maps could be used to define scoped domains and the maps could be referred to in the attributes on the load/store's.

// alias scope domain
#0 = () -> (0)

%0 = llvm.load ... {alias_domain = #0} :

This would look nearly isomorphic to the resulting LLVM IR. To define a list of domains:

#2 = () -> (0, 1)

The key issue with using affine maps in the LLVM dialect would appear to be that it goes against a future plan to move the affine expr/map infrastructure into the affine dialect to make core MLIR leaner. But one could argue that a use case like this is a motivation to just keep it in IR the way it is; otherwise, we'd anyway need another structure and the associated infra to model such "global" metadata and it could be representationally less powerful if it's just designed for the "constant integers defining domains" case. It'd be good to consider now what sort of other LLVM metadata/attributes we aren't able to model well atm.

The above still doesn't address the 1:1 mapping question you have, but it allows use outside the LLVM dialect as well (since the alias domain information is likely populated in higher level dialects). The other approach could be to just use a list of integers as an attribute instead of a map. But you won't get out of line printing from the IR. In either case, mlir-translate would have to generate the domains based on alias_domain attributes.

@joker-eph and @ftynse will have useful feedback on this.

I'm going to be OOO for a couple of days. I could give this a thought when I'm back if nobody has started already. However, I think I'll need some help.

I can help with this.

~ Uday

Thanks, Diego

nicolasvasilache commented 4 years ago

Hi Diego,

Here are a few of the tradeoffs involved here:

the current solution (passing pointers to normalized structs allocated locally) is really a temporary kludge because there is currently no good solution to laying out data properly in MLIR to interop with externally linked libraries. Others have pointed out that this is not a trivial problem and that doing it properly would involve taking sizeable chunks of code from clang and doing target-specific codegen with generic support of LLVM's TargetTriple. We welcome contributions on this front as no one on our side is currently woking on this.
it is possible to decide to flatten out the internals of the struct and avoid 1. You could add a flag when lowering to LLVM and implement this. The discussion in the strided memref RFC spells out why this cannot be the default behavior and that we are open to evolution based on concrete perf data (which you have).
note however that the above is still very specialized. We are going to start experimenting with descriptors for richer data types soon(ish) and point 2. is not a solution outside of dense memrefs.

Bottom line, I think you guys properly identified a reasonable solution that is general enough and IMO it is the right path forward.

Irrespective of making aliasing work on the current struct descriptors, it is also reasonable to implement a working solution, controlled by a flag, for point 2 for the following cases:

fully specialized memref type (i.e. naked pointer)
fully dynamic memref type (i.e. flattening the whole descriptor as a list of arguments)

However I think the previous approach (i.e. partial specializations and passing only exactly the dynamic sizes and strides) is premature optimization that has a high code complexity cost. I would still avoid it in the absence of strong supporting performance data.

ftynse commented 4 years ago

There was another person on the mailing list interested in mermef lowering schemes other than our current descriptors, so I think providing an easy way to plug into the std->llvm lowering is a reasonable short-term solution. This will allow one to implement the point 2. from Nicolas's comment above as a lowering option.

Note that the dialect conversion framework actually supports one-to-many rewrites of function arguments, so it is technically possible to unpack the descriptor into individual arguments, and potentially rewrite the allocated+aligned pair of aliasing pointers into allocated+alignment_offset pair of non-aliasing pointers and integer. However, using multiple arguments to a function quickly gets into the ABI territory if you want to interact with C code.

The longer-term solution would be to introduce an alias model into MLIR proper, and I would be happy to see an RFC going into that direction.

I was wondering if we currently have support to represent metadata in LLVM-IR dialect in some way.

There's no direct support for LLVM metadata, but using attributes to model individual pieces of metadata on the need-driven basis is the way to go IMO.

AFAIK, they are the only thing (besides integer sets) which have out of line definitions/IR printing.

Any attributes are allowed in attribute aliases, https://github.com/tensorflow/mlir/blob/master/g3doc/LangRef.md#attribute-value-aliases, in particular array and dictionary attributes work perfectly fine and are exercised by linalg/structured ops.

#scope0 = [0,1]
#scope1 = [2,3]
#scope3 = [#scope0, #scope1]

llvm.func @foo() {
  // ...
  llvm.load %0 {llvm.alias_scope = #scope3} : !llvm<"float*">
}

could work just fine. The problem with this approach (and with affine maps) is that it could create arbitrarily nested structures, potentially with repeating values inside, rather than become a flattened integer set. If we were to model this properly, I would suggest introducing a separate attribute type that has set-like behavior with unpacking on top of a list. There was a discussion in the mailing list on introducing leading keywords in attribute syntax to disambiguate between affine maps and function types, so something like #scope = flatset {0, 1} would make sense eventually.

dcaballe commented 4 years ago

Thank you all for the feedback! I think we all agree on that introducing an alias model is the long-term solution, also regardless of this particular issue. Let me investigate a bit more how difficult introducing custom memref lowering schemes would be.

ftynse commented 4 years ago

https://github.com/tensorflow/mlir/commit/291a309d7164d9d38a8cc11cfdf88ca2ff709da0 implemented a separation between memory-related conversion patterns and the remaining patterns. From there, it should suffice to derive LLVMTypeConverter, override its convertType() to lower MemRefType differently while deferring to the base class for the other types, and provide the lowering patterns for std operations listed in populateStdToLLVMMemoryConversionPattern. This should unblock you in the short term by doing local changes in your pipeline. We remain open to discuss upstream solutions.

joker-eph commented 4 years ago

On Thu, Dec 12, 2019 at 1:54 PM Uday Bondhugula notifications@github.com wrote:

Your proposed approach sounds reasonable to me. I was wondering if we currently have support to represent metadata in LLVM-IR dialect in some way. Otherwise, translating "markers" (attributes?) on LLVM-IR dialect loads/stores to proper alias domains using metadata in LLVM-IR might not be as a straightforward 1:1 mapping as the rest of the translation. Any thoughts on this?

I don't think we have examples of representing metadata like LLVM scope domains in MLIR. Also, just translating op attributes to LLVM instruction metadata won't be sufficient here since the alias metadata would have out of line definitions as shown below that we may want to be printed in a similar way with the LLVM dialect.

; scope domain !0 = !{!0} !1 = !{!1}

; scope list !2 = !{!0, !1}

%0 = load float, float* %c, align 4, !alias.scope !5

In fact, if we want something that looks close to the printed LLVM IR (out of line definitions for the domains), a possible approach with MLIR would be to use 0-d constant affine maps. AFAIK, they are the only thing (besides integer sets) which have out of line definitions/IR printing. All the infra needed is also already in the IR. 0-d constant affine maps could be used to define scoped domains and the maps could be referred to in the attributes on the load/store's.

// alias scope domain

0 = () -> (0)

%0 = llvm.load ... {alias_domain = #0} :

This would look nearly isomorphic to the resulting LLVM IR. To define a list of domains:

2 = () -> (0, 1)

The key issue with using affine maps in the LLVM dialect would appear to be that it goes against a future plan to move the affine expr/map infrastructure into the affine dialect to make core MLIR leaner. But one could argue that a use case like this is a motivation to just keep it in IR the way it is; otherwise, we'd anyway need another structure and the associated infra to model such "global" metadata and it could be representationally less powerful if it's just designed for the "constant integers defining domains" case. It'd be good to consider now what sort of other LLVM metadata/attributes we aren't able to model well atm.

Using affine maps to represent a list of integer seems quite like a big hammer to me, it isn't clear why we would do this instead of using attributes for example. But beyond this: in LLVM the aliasing scope and domains aren't "integer", they are just unique abstract entities and you don't manipulate domain with a number, you just create a metadata and the infrastructure keeps it unique. The integer in the textual print is just the standard metadata printing in LLVM (similarly like SSA value numbers don't exist in the in-memory IR and are just an artifact of serialization).

Also, we have a way to represent "global" (module level really) informations: operations. I don't know what is the best representation for this, but we can't explore a few things.

-- Mehdi

The above still doesn't address the 1:1 mapping question you have, but it allows use outside the LLVM dialect as well (since the alias domain information is likely populated in higher level dialects). The other approach could be to just use a list of integers as an attribute instead of a map. But you won't get out of line printing from the IR. In either case, mlir-translate would have to generate the domains based on alias_domain attributes.

@joker-eph https://github.com/joker-eph and @ftynse https://github.com/ftynse will have useful feedback on this.

I'm going to be OOO for a couple of days. I could give this a thought when I'm back if nobody has started already. However, I think I'll need some help.

I can help with this.

~ Uday

Thanks, Diego

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/mlir/issues/309?email_source=notifications&email_token=AAZXKDE2PVCX6W4DX635XKTQYIYBDA5CNFSM4JZB4QJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGWSITI#issuecomment-564995149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZXKDD7VZNSZYHRJUOPN2TQYIYBDANCNFSM4JZB4QJA .

dcaballe commented 4 years ago

I implemented my own LLVMTypeConverter with a custom lowering for MemRef type. However, that was not enough since Std-to-LLVM conversion uses MemRefDescriptor class all over the place, which assumes that MemRef type is lowered to the struct descriptor and generates the corresponding insertion and extraction operations to/from the struct.

To overcome this issue, I introduced some changes to be able to provide a custom MemRefDescriptor as well (https://github.com/tensorflow/mlir/pull/337). This PR also introduces a test with a custom MemRef descriptor lowering that lowers MemRef types to plain pointers. This seems to work nicely. However, it's not enough to workaround the aliasing problem. Currently, due to ABI requirements, convertFunctionSingnature and FuncOpConversion introduce an extra level of indirection on MemRef function arguments so llvm.noalias attribute ends up on the first pointer:

func @check_noalias (arg0: !llvm<"float**"> {llvm.noalias = true}) {

However, I found a simple workaround for now. I locally overrode convertFunctionSignature and implemented my custom FuncOpConversion that avoid the extra indirection. This temporarily unblocks me although I'm not very happy with the solution. I think we should introduce a mechanism to deal with custom ABIs.

What do you think?

PS: Commit https://github.com/tensorflow/mlir/commit/291a309d7164d9d38a8cc11cfdf88ca2ff709da0 doesn't seem to be needed atm.

ftynse commented 4 years ago

However, that was not enough since Std-to-LLVM conversion uses MemRefDescriptor class all over the place, which assumes that MemRef type is lowered to the struct descriptor and generates the corresponding insertion and extraction operations to/from the struct.

That's why I separated the memref-related patterns into a separate populate* function. My suggestion was to

provide the lowering patterns for std operations listed in populateStdToLLVMMemoryConversionPattern

that is, implement new patterns that are aware of a different lowering scheme and use them instead of the existing ones. A different lowering pattern indeed removes your problem with ABI requirements in function signature. I think using different patterns for different lowering schemes is the right solution achieving better separation of concerns. It is also significantly easier to implement that custom ABI support. Why are you not happy with it?

I did not read #337 in detail, but it goes in the wrong direction IMO. It essentially makes the current descriptor more standard and requires any other implementation to comply with its API. There may be cases where the offset field or the differentiation between allocated and aligned don't make sense. (Bare pointers is one of them, although you might get away with it. Using more complex layouts than sizes+strides will be impossible). So why insist on reusing the existing patterns through an extra layer of indirection when defining new patterns, which would also produce cleaner IR that is aware of the actual descriptor, is relatively cheap?

dcaballe commented 4 years ago

Sorry, I saw this comment after replying to the PR.

Why are you not happy with it?

Basically, what I said in the PR. Of course we can always create our own LLVM lowering but that has a high maintainability and code duplication cost. A custom LLVM lowering won't have control on the Std ops reaching the lowering so we would have to be catching up with upstream every time something changes in memref-related ops in Std dialect or new ops are added.

It essentially makes the current descriptor more standard and requires any other implementation to comply with its API.

That's because I didn't want to change the existing API to minimize changes. I hoped the API could be turned into something more generic in subsequent patches.

Maybe a good trade-off could be hosting upstream these alternative LLVM patterns to lower MemRef to plain pointers. If I understood correctly, there are more people interested in this. That would reduce the maintainability burden and maybe some code reuse across lowerings is possible. WDYT?

ftynse commented 4 years ago

Basically, what I said in the PR. Of course we can always create our own LLVM lowering but that has a high maintainability and code duplication cost.

LLVM lowering isn't a monolithic pass and it shouldn't be, so I would argue you are not creating an entirely new lowering, just changing the part that you want to be handled differently. As for the duplication cost, given how different is the lowering scheme, how much code would be actually duplicated? Most of the code in the current patterns is handling dynamic sizes and strides and it's essentially dead if you don't have them. If there is something reusable indeed, that can be factored out into helper functions or something like pattern factories.

A custom LLVM lowering won't have control on the Std ops reaching the lowering

It should be possible to return matchFailure() from the pattern. Actually, by having a custom pattern you get the possibility of accepting different operations than the current lowering does. By patching into the current lowering you must support everything it does and cannot support anything else.

That's because I didn't want to change the existing API to minimize changes. I hoped the API could be turned into something more generic in subsequent patches.

If we push the "being generic" part a couple of steps further (e.g., don't rely on the specific descriptor API, having allocated/aligned pointers, etc.), we will essentially get the default pattern along the lines of

// load/store loewring
BaseDescriptor d = lowering.convertMemRefToOpaqueDescriptor(memref.getType());
Value *p = lowering.computeActualAddressAndGetPointerFromDescriptor(d, indices);
rewriter.repalceOpWithNewOp<LLVM::LoadOp>(op, p);

at which point, having any sort of dispatch to actual implementations from the "default" pattern looks like overhead compared to just having different patterns.

Maybe a good trade-off could be hosting upstream these alternative LLVM patterns to lower MemRef to plain pointers. If I understood correctly, there are more people interested in this. That would reduce the maintainability burden and maybe some code reuse across lowerings is possible. WDYT?

This would shift the burden on whoever updates the core, not necessarily reduce it. But I understand the concern of getting out of sync. So I'm fine with upstreaming an additional lowering for memrefs to go to bare pointers when possible and failing otherwise. IMO, this can still be implemented as separate patterns producing simple code, sharing implementation helpers if reasonable. The choice between lowerings can be controlled by an option in the pass, similarly to what we do for malloc/alloca.