This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it.
With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of https://github.com/nod-ai/iree-amd-aie/pull/812 and brings the dma optimizations in Passes.cpp.