Tracking ops may fail if a pattern does not call replace.

gysit commented 2 years ago

@Mogball @nicolasvasilache

There is an interesting "issue" after a recent LLVM change (https://github.com/llvm/llvm-project/commit/589eac6524d6ba080a51107757c1f356c365d047). The pattern replaces a LinalgOp without calling the replaceOp method. As a result, the listeners do not observe the replacement and the "payload op" to "transformed op" mapping gets out of sync.

The problem can be reproduced by running: build/bin/mlir-proto-opt test/LinalgTransform/double-tiling.mlir -linalg-interp-transforms The interpreter generates only three instead of six tile loops.

A way to fix this issue is to ensure the pattern calls the replaceOp (https://reviews.llvm.org/D121369). I wonder if there is a solution to make the tracking infrastructure more robust to cover such cases? I think the pattern does what it is supposed to do. It matches the cast op and properly calls replaceOp for the cast operation. Adapting patterns to make the replace explicit still seems like the way to go...

ftynse commented 2 years ago

It looks challenging to extend the tracking to support this specific case, although I may need to think further about it. Maybe we can also intercept replacements of tensor casts and walk up and down their use-def chains to see if we hit some tracked Linalg op that is affected, similarly to how we can see through casts when replacing a Linalg op with a cast.

The FoldTensorCastConsumerOp pattern (don't be confused by the name, it is a pattern and not a folder) may or may not replace the linalg op. There is a control flow path where the op seems to be only cloned. I suppose the old op is expected to be DCE'd, at which point we would need to somehow match that the op being removed by DCE was previously "replaced" with something else.

The safest longer-term alternative IMO is to move towards "functional patterns" that we discussed and started introducing for some transformations. Such patterns return the updated "core" operation instead of just success/failure status so it is easy for external observers to update their state. This is not a foolproof solution though, some very generic canonicalization patterns may still find hacky ways of replacing tracked ops without adopting the "functional pattern" style. We can only assert aggressively against this and deal with the consequences when assertions fire, hoping that we have enough testing to deter whoever introducing such indiscriminate patterns from pushing them on everyone.

gysit commented 2 years ago

tensor casts and walk up and down their use-def chains to see if we hit some tracked Linalg op that is affected

Right that may work. When observing the replacement of an operation we may also consider deleted (DCE) and inserted adjacent operations and could then generate a replacement notification for them. In the example pattern, one LinalgOp is deleted and one LinalgOp is inserted so it can be seen as part of the CastOp replacement... It is less clear though what should happen if multiple LinalgOps are generated... We may just add replacement information for all of them and then let transformations filter the interesting ones.

I suppose the old op is expected to be DCE'd, at which point we would need to somehow match that the op being removed by DCE was previously "replaced" with something else.

After looking at the FoldTensorCastConsumerOp pattern for some more time, I actually believe there is a bug if the linalgOp has multiple results. In that case, both linalgOp and newOp will remain in the code. Writing the pattern with replaceOp actually fixes the problem (https://reviews.llvm.org/D121369) if I am not mistaken.

The safest longer-term alternative IMO is to move towards "functional patterns"

I agree and it could well be that short-term we should just fix patterns that do not use replaceOp properly.

PS Would it make sense to use Location information for the operation tracking. It may not help with this problem but it may help to debug or serialize the transformation history if an op is associated to its source code location but also to the location of the transform ops that have been applied.

ftynse commented 2 years ago

When observing the replacement of an operation we may also consider deleted (DCE) and inserted adjacent operations and could then generate a replacement notification for them. In the example pattern, one LinalgOp is deleted and one LinalgOp is inserted so it can be seen as part of the CastOp replacement...

Adjacency is a weak relation here, I wouldn't build on that. We could indeed try to "match" a deletion and an insertion as a replacement, but I am wary of the tracking mechanism becoming excessively complex, the same way as dialect conversion that is no longer fully understood :) My thinking is that ultimately there is an operand being replaced by another value in the IR, so it should be possible to trace back from there. It's just that we don't have a hook for operand replacements.

I agree and it could well be that short-term we should just fix patterns that do not use replaceOp properly.

+1.

PS Would it make sense to use Location information for the operation tracking. It may not help with this problem but it may help to debug or serialize the transformation history if an op is associated to its source code location but also to the location of the transform ops that have been applied.

Most patterns just mindlessly copy the location info of the "main" matched op to all the new ops, so I expect this to be difficult. Some time ago (pre-Discourse, I think), I proposed to optionally append information about the patterns applied to the location info for debugging purposes. That is, the location info would contain the original location + "converted by PatternName in filename.cc:123". This was considered too expensive because this information is going to be stored in the context forever.

gysit commented 2 years ago

I am wary of the tracking mechanism becoming excessively complex

I agree matching adjacent operations is complicated and hard to follow for a user.

I proposed to optionally append information about the patterns applied to the location info for debugging purposes. That is, the location info would contain the original location + "converted by PatternName in filename.cc:123". This was considered too expensive because this information is going to be stored in the context forever.

Ah interesting! I was thinking about it from user perspective but I can see that it would add a significant amount of extra data that needs to be carried around...

iree-org / iree-llvm-sandbox

Tracking ops may fail if a pattern does not call replace. #351