If they have the same attributes and there are no padding values, we can just fold them away. They become the source of unpack op.
If they have different attributes and there are no padding values, we should be able to decompose the ops and they will become transpose + reshapes + transpose ops. The transpose op can be fused with its producers and consumers; the reshape ops are just metadata ops. E.g.,
One questions is that if we want to do it before forming dispatches. The reshape ops become fusion barrier in this case and we can fuse them with consumers and producers. W/o decomposition, we can form unpack + pack into a dispatch; we can tile and distribute the work. IMO, it's a bit worse because we need an extra kernel launch (if there are producers and consumers).
The simplification will result in more dispatch launches because the reshape op becomes a barrier. We should just fuse unfoldable unpack+pack into a dispatch and codegen.
If they have the same attributes and there are no padding values, we can just fold them away. They become the source of unpack op.
If they have different attributes and there are no padding values, we should be able to decompose the ops and they will become transpose + reshapes + transpose ops. The transpose op can be fused with its producers and consumers; the reshape ops are just metadata ops. E.g.,
After decomposition:
One questions is that if we want to do it before forming dispatches. The reshape ops become fusion barrier in this case and we can fuse them with consumers and producers. W/o decomposition, we can form
unpack + pack
into a dispatch; we can tile and distribute the work. IMO, it's a bit worse because we need an extra kernel launch (if there are producers and consumers).