[Feature Request] Move graph compilation behind higher transformers (graph optimization)

peishenyan commented 3 months ago

Describe the feature request

Hi developers, During my exploration for supporting L2 graph optimization for WebNN EP, the logic in ORT confuses me. In inference_session.cc line 1233-1243, I find that partitioner.Partition() function is called before level 2 and level 3 optimization. However, some execution providers such as NNAPI and WebNN will do ep->compile() in partitioner.Partition(), which builds the supported graph into a node and loses the opportunity to perform further graph optimization. (Some other EPs, which only register per op instead of a graph/subgraph, do not need to do compile).

So I wonder, logically we should check the capability of each EP and partition the graph for different EPs after making all changes to the graph, but why here do partition first and do high level transforms later? Can we re-order this?

For DML EP, in order to achieve L2/L3 level graph optimization and preserve graph compilation, they compromised by disguising the whole graph compilation as L3 level optimization, called DmlGraphFusionTransformer. This method takes a lot of efforts and it leads me to believe that the above problem is a general one rather than WebNN specific.

Describe scenario use case

This will benefit all EPs that need to compile the supported graph/subgraph into a node and achieve high-level graph optimizations.

skottmckay commented 3 months ago

I believe L2 and L3 were intended for internal ORT EP specific optimizations. Because they're EP specific they cannot run before partitioning. Unfortunately the separation is polluted a little by allowing some additional EPs with static kernels to use L2 and L3 optimizers. That has historical reasons, it would be better if we abstracted out the more re-usable optimization logic so any EP could leverage that during partitioning, but it's how it is currently.

In theory you could delay the Compile until later, but that just encourages further pollution of core code with lists of EPs that the L2/L3 optimizations apply to, along with having to maintain a whole lot of state information from partitioning until after those optimizations in order to call Compile. The existing lists are already somewhat out of control.

That said, an EP is free to implement its own optimizations during partitioning.

If it has static kernels it can return a ComputeCapability that fuses node/s to use the static kernel.
- e.g. the XNNPACK EP returns a ComputeCapability that results in a set of nodes (DequantizeLinear, operator, QuantizeLinear) being fused into a single quantized operator here, but the same general concept can be applied to any set of nodes to fuse.
- The kernel for the Domain+OpType in the ComputeCapability just needs to be in the EPs KernelRegistry, so the EP is free to use custom operator names for the fused nodes.
- The ORT partitioning code handles replacing the nodes in the ComputeCapabilty with a new node
If it is a compiling EP it can do whatever it wants in the implementation of Compile, as that returns a function pointer and what happen inside of that is completely opaque to ORT.

peishenyan commented 3 months ago

Thanks for your response. I will try to solve these problems.

peishenyan commented 2 months ago

Thanks again for your answers. But I ran into some problems while implementing our own optimizations during partitioning. Since partitioner.Partition() gets a GraphViewer instead of a Graph, it cannot reuse the existing transformers that get a Graph and change the graph structure inside. GraphViewer makes it much more complex to implement graph optimization in compile(). Maybe we should copy and rewrite the transformers that we want to use for WebNN? This does not seem to be a good idea.

And, this made me to think further. We should note that the sense of putting partitioning before L2/L3 optimizations is to let each EP performs graph optimization within the supported subgraphs, and the sense of graph compilation is to let each compiling EP be able to compile the static subgraph into an execution unit. Based on this analysis, I think I have enough reasons to separate partitioning and compilation and to delay the compilation until the highest optimizations are complete, which can make the whole process clearer.

skottmckay commented 2 months ago

An optimizer generally has 2 parts.

First the logic to determine if a set of nodes can be modified which can be done with a GraphViewer. Existing optimizer logic could be factored to split this out for re-use.

Second is the logic to make the modifications.

In IExecutionProvider::Compile you should not be thinking about modifying the original graph at all. ORT is going to delete all of the nodes being compiled (which are provided via the FusedNodeAndGraph.filtered_graph GraphViewer) and replace with FusedNodeAndGraph.fused_node.

The role of Compile is to set the function pointers in NodeComputeInfo.node_compute_funcs to provide the logic that was previously handled by filtered_graph. That can be done however you wish. For something like CoreML as an example we build a CoreML model from filtered_graph as CoreML doesn't know or care about ONNX nodes. All of this is external to the original model.

peishenyan commented 2 months ago

I completely agree with you. Any optimizer should not be applied at the compilation stage. We should give the compiling EPs a chance to do further optimization, so separation of partitioning and compilation is necessary.

So I wonder if ORT can give the second chance to compiling EPs to compile the graph after L2/L3 optimizations and let each EP decide for itself where to compile. I have made some attempts in my own workspace https://github.com/peishenyan/onnxruntime/commit/5220a192b1844b31c021d318361b824f14782798 In inference_session.cc, I define a new function called PostPartition(), which has the same structure as partitioner.Partition(), but is located after the L2/L3 optimizations. In Partition(), WebNN EP's GetCapability() function returns a list of nodes, just like non-compiling EPs, which lets ORT know which nodes WebNN EP supports for computation and they won't be compiled. In PostPartition(), the PostGetCapability() function of WebNN EP returns a list of subgraphs as before, which will be compiled. In this way, WebNN EP can benefit from L2/L3 optimizations while maintaining the ability to compile.

Not only WebNN EP, I believe that all compiling EPs can benefit from this structure if they can decompose their original logic into partitioning and compiling.

peishenyan commented 2 months ago

Hi @skottmckay , splitting the partitioning and compilation is the most general solution that I can think of for this compiling EP's problem. I wonder what your opinion is on my experimental code https://github.com/peishenyan/onnxruntime/commit/5220a192b1844b31c021d318361b824f14782798 or do you have any other suggestion? Thanks.

skottmckay commented 2 months ago

Sorry not quite following. I wasn't suggesting an EP shouldn't do optimizations in the compilation phase. If anything, that's the place I would expect optimizations to happen as the EP is free to do whatever it wants in Compile.

Level 2 and 3 optimizers use internal ORT operators and not ONNX operators. We don't expect external EPs to be taking a dependency on our internal operators in general as they're not defined in a public specification like ONNX operators are.

Giving an EP direct access to the graph also means it can break it. When it breaks, it's not unreasonable for someone to think ORT is to blame given it owns the graph and to create a GitHub Issue we need to deal with.

If I understand correctly

you want to re-use some of the existing ORT L2 or L3 optimizer logic
the implementation of those currently modifies the graph directly
IExecutionProvider::Compile receives a GraphViewer and cannot modify the graph directly

It's not clear why you REQUIRE the ability to modify the graph directly in the EP's Compile, given that as soon as Compile completes ORT will delete all the nodes that were in the subgraph being compiled.

Is the real problem to solve that you want to somehow re-use existing L2 or L3 optimizer logic and we can separate that from any requirement to directly modify the graph?

Maybe a concrete example would help. What is a specific optimization you want to re-use during Compile?

peishenyan commented 2 months ago

If I understand correctly

you want to re-use some of the existing ORT L2 or L3 optimizer logic

the implementation of those currently modifies the graph directly

IExecutionProvider::Compile receives a GraphViewer and cannot modify the graph directly

This is what I want to express. To be honest, my most immediate goal is to enable Gelu and LayerNorm fusion for WebNN EP. Gelu and LayerNorm are two ONNX operators whose fusion is implemented in L2 optimization. We want to reuse the operator fusion optimizations like these two.

BTW, why do GeluFusion and LayerNormFusion appear in L2 optimization instead of L1? 😂 Gelu and LayerNorm are ONNX operators

Honry commented 2 months ago

@fdwr, do you have any insightful perspectives on this issue?

skottmckay commented 2 months ago

Gelu and LayerNorm were only relatively recently added as ONNX operators. Now that they are, I would expect the optimizers that add them could move to L1.

peishenyan commented 2 months ago

Sounds good! I think I can help to add them to L1 if needed.😸

skottmckay commented 2 months ago

If you want to put together a PR to move them to L1 that would be great.

One caveat. Applying the fusion in L1 will need to be conditional on the ONNX opset the model imports to make sure the ONNX operator is valid. LayerNorm requires opset 16+ and Gelu requires opset 20+.

Due to that they may also need to run in L2 for older models as the internal contrib op doesn't have a requirement for any specific ONNX opset.

peishenyan commented 1 month ago

@skottmckay Thanks. I'm working on adding the Gelu and LayerNorm fusion adaptively to the L1/L2 optimization. One problem is that we may have two ways:

In the function GenerateTransformers() of graph_transformer_utils.cc we determine where to add these two fusion transformers according to the opset version. However, as a utils of graph_transformer, there is no way to know the graph, much less the opset version.
We add them both to the L1 and L2 optimizations. When they are applied, we check the opset version and the optimization level to determine whether to skip. However, GraphTransformerManager::Register will stop this due to "The transformer is being registered repeatedly". We need to at least rename it.

This is so confusing.😢

microsoft / onnxruntime

[Feature Request] Move graph compilation behind higher transformers (graph optimization) #20915

Describe the feature request

Describe scenario use case