Metrics metadata support

lutzroeder commented 6 months ago

Scenarios:

1) Allowing the app to contain implementations to compute metrics. See #204. 2) Open existing files defined by framework and format vendors. Which formats and tools exist? 3) Supporting scripted solutions and formats from custom tools. See #1234.

Questions:

Are computed or loaded metrics different from metadata provided via onnx/onnx#5938 or metadata_schema.fbs?
Should these values show as part of the Node metadata UI or separate?
What API structure supports all scenarios?

kylesayrs commented 3 months ago

I'd be interested in helping with this feature. I implemented a similar feature for Neural Magic's Sparsezoo calculate_ops.py

I propose support of both weight sparsity metrics and operations metrics.

Counting operations depends upon whether the runtime engine and hardware supports sparsity, block sparsity, and quantization. The UI design should be capable of supporting these subtypes, if not now then in the future.

Since onnx/onnx#5938 and metadata_schema.fbs seem to be unstructured, supporting these kinds of visualizations seems to be a separate issue.

I propose showing weight sparsity within the node metadata UI alongside the weight, ie

name: model.3.conv.weight
category: Initializer
tensor: float32[64,32,3,3] (81% sparsity)

As for visualizing operations, I'm in favor of separating the UI from the node metadata tab as to make it clear that these performance (operation) metrics are computed values separate from the data which is embedded in the model file. For example, these could be a togglable UI displayed to the left of a node.

kylesayrs commented 3 months ago

Another UI idea might be togglable horizontal bars which appear to the left of a node and have different sizes depending on how many operations of which types are associated with it.

There should also be a UI element for the total number of ops/sparsity in the model, perhaps in the bottom right

lutzroeder commented 2 months ago

@kylesayrs all great questions.

There seem to be 3 types of data and a question how these are unified and exposed at which API layer.

1) Metrics that are included in the file format or computed with external tools and provided via supplemental files. How do these get surfaced in the API. If metrics are included in metadata, do they surface as metadata or get filtered into metrics during loading? 2) Metrics that require format specific computation. Do such metrics need a format specific implementation and should it be in the actual model API or separate? 3) Metrics that can be generally computed for all formats. Is this another layer that takes over if neither of the other two exist.

Since there are likely metrics at the model, graph, node and weights level, initially exposing them as another section in properties pages might be a good way to get started. Which data types exist for metrics in the API. For example, if sparsity is a float percentage, could such a single metric later be selected in the properties and used to augment and color the graph, see #1241.

kylesayrs commented 2 months ago

@lutzroeder Hm, we can implement two classes, NodeMetrics and GraphMetrics.

NodeMetrics

Node implements a member called metrics of type onnx.NodeMetrics
When Node is constructed, node metrics are calculated based on the weight and bias arguments†.

This class has members such as

parameters: int
sparse_parameters: int
parameter_sparsity: float
operations: int
sparse_operations: int
operation_sparsity: float

This class hooks into the Node Properties sidebar and is shown in a metrics section

GraphMetrics

GraphMetrics is instantiated as a member of Graph and depends upon _nodes, specifically each node.metrics of type NodeMetrics

This class aggregates metrics across nodes and includes members such as

parameters: int
sparse_parameters: int
parameter_sparsity: float
operations: int
sparse_operations: int
operation_sparsity: float

This class hooks into the Model Properties sidebar and is shown in a Metrics section

To respond to the questions you posed

I'm not familiar with metric formats provided in supplemental files, but this could be supported with a ModelMetrics class instance on the Model class.
Similarly to (1), these could be implemented in a separate class on the Model API
This would be implemented by the two APIs proposed above

Let me know what you think

†Note that in order to calculate per-node metrics for ONNX, we'll need to hard code which arguments are weights and which arguments are biases for each op type

kylesayrs commented 2 months ago

Expanded metrics view

Compact metrics view

I prefer the compact view, at least for the frontend. The backend can maintain a separate members for sparsity: float, ect. to better support metric-based visualization, but I think the compact view looks nicer for users.

lutzroeder commented 2 months ago

For weight tensors there should be a Tensor Properties view similar to #1122. This will be needed for visualizing tensor data #65, avoids duplicating tensor information, gives each tensor a less crowded space, and solves the issue of mixing node metrics and tensor metrics. The individual metrics would be rendered similar to attribute or metadata, which hopefully results in a single mechanism across attributes, metadata, metrics to annotate the graph.

For implementation, foo.Node::metrics and foo.Tensor::metrics similar to foo.Node::attributes and foo.Tensor::metadata which returns a list of foo.Argument. This allows the mechanism to be extensible for format-specific new metrics. The initial implementation could be all wrapped in a single get metrics(). This would only be used for format-specific overrides which are hopefully rare as the code will add maintenance complexity and increase the file size for a feature that is likely used much less frequently.

kylesayrs commented 2 months ago

@lutzroeder In order to analyze sparsity, foo.Tensor must decode the tensor data. Afaict this is only implemented within view.js, some version of _decodeData's implementation should be moved to a shared helper file

lutzroeder commented 2 months ago

Tensor decoding is generalized in view.Tensor. A lot of effort went into having a single view onto the many different tensor formats. Ideally, metrics should operate at that level and automatically work for all or most formats. The format-specific API is more for keeping options open. Would be interesting to discover the edge cases where format-specific metrics need tensor access, initially not supported.

Generalized metric implementation in view.Tensor::metrics drives view.TensorSidebar and calls into xxx.Tensor::metrics to honor custom format-specific metrics or implementations if available. Should cover most cases. Node metrics might be more interesting as various optimizations like inlining const nodes are hidden in the general API.
xxx.Tensor::metrics can provide format-specific implementation to override the general case when needed.
Model formats might store metrics as metadata or provide metrics via external files. The model loading code could detect these scenarios and expose them via xxx:Tensor::metrics. This might include additional metrics that are not known to the app, similar to metadata which is often unstructured but might include known types of metadata.
Other tools might provide a generalized metrics format that integrates at the view.Tensor::metrics or view.Node::metrics layer. Until there is more information what these look like not a main concern but the implementation should make it possible to fork or opt-in later.

If the general metrics implementations get complex and impact load times it might be worth considering dynamically loading a module from view.Tensor::metrics, view.Node::metrics, too early to tell.

For tensor, the challenge is multiple changes are needed to enable #1285. Some formats have separate concepts for tensor initializer and tensor and how to opt-in quantization, what level of abstraction should this view operate on. view.Tensor is generated on demand while other objects like view.Node are in the view object model to enable selection and activation. The actual tensor data access can be expensive and needs to be re-factored to not happen in the constructor if those objects exist in the view object model. How to dispose the potentially large cached tensor data if other objects are selected.

kylesayrs commented 2 months ago

Copying my thoughts on default metrics here

https://github.com/lutzroeder/netron/pull/1293#discussion_r1641293800

kylesayrs commented 2 months ago

@lutzroeder I personally prefer to not show default (format agnostic) metrics, since these metrics are guaranteed to be unreliable without the full context of the format.

I propose an implementation where each format xxx.Tensor implements xxx.Tensor::calculateMetrics(values). This way the decoding of values stays in view.js and the frontend is free to lazily compute and cache values as needed.

view.Tensor = class {
    this._tensor = tensor;  // type: xxx.Tensor
    this._metrics = null;

    ...

    get metrics() {
        if (this._metrics === null) {
            const value = this.value;
            this._metrics = this._tensor.computeMetrics(value)
        }

        return this._metrics
    }
}

lutzroeder commented 2 months ago

No as that would require lower level APIs to take dependencies on a higher level API surface.

It would also lead to an explosion of metric implementations as each metric would have to be implemented for all formats. Those should probably be computed by the runtime itself and stored as metadata in the model file or a supplemental file.

What are 3 specific examples where a general metric is unreliable? Would it be possible to generalize the lower level tensor format to support these cases? The answer might be different for tensor formats (which tend to generalize well) and node formats (which often require processing to make the graph readable).

kylesayrs commented 2 months ago

I'll focus on just tensor sparsity, arguably the most basic of the metrics. This metric only makes sense in the context of how a theoretical inference engine would perform inference, and how an inference engine performs inference depends on what operation is being performed.

Sparsity does not apply to tensors belonging to operations such as ONNX::Gather, Pytorch::BatchNorm2d, and Tflite::Reshape. This is because the operation that they belong to cannot be no-oped by any sparsity-aware engine.
Sparsity depends on whether the parent node is a quantized operation. For example, the sparsity of the weight tensor for ONNX:: QLinearMatMul depends on the zero point. The zero point is another tensor which exists on the parent node. Also, ONNX supports channel-wise quantization, which is the case where there is a separate zero point for each channel of the tensor.
Sparsity depends on whether the tensor is a weight or a bias (or neither, as state in example 1). In all the sparsity-aware engines I know of, weights are converted to no-ops but biases are not.

It's my opinion that computing sparsity for tensors in these scenarios is misleading. For example, let's say we implement a metric search feature. If someone wants to use this feature to find tensors which they should prune, they would query for tensors with sparsity < 80% and potentially get back a random Gather tensor. Not super helpful. In my opinion, it's confusing to see sparsity displayed for a tensor belonging to an operation like Unsqueeze which has nothing to do with sparsity. Computing any kind of metric (https://github.com/lutzroeder/netron/issues/65) for these kinds of tensors seems not very useful?

kylesayrs commented 2 months ago

Would it be possible to generalize the lower level tensor format to support these cases? The answer might be different for tensor formats

I think I need more context or an example of what you're thinking of. You're suggesting grouping tensors? This would still require some format-specific implementation, although it might help.

It would also lead to an explosion of metric implementations as each metric would have to be implemented for all formats

To be fair, the end goal is to support FLOPS, which is by far the most requested metric (#204). FLOPS are 100% operation-specific, so trying to support FLOP counts for all operations across all formats is way too large of a scope in the first place.

kylesayrs commented 2 months ago

I think sparsity and operation count metrics are a super valuable feature and could really help a lot of people to a sense of how large their models, how well they will perform, and where the performance bottlenecks are.

No implementation will be perfect, since metrics related to model performance are entirely dependent on the hardware and the level of sparsity-awareness of the inference engine. I think if we want to support FLOPS, we need to implement it on a per-format basis, which means a slow format-wise rollout

lutzroeder commented 2 months ago

Sparsity does not apply to tensors belonging to operations

One idea would be that format-specific implementation can disable by returning a value like undefined or NaN. Higher-level code filters and skips the metric. Not sure if that applies for sparsity though. Would sparsity exist for nodes as well? For tensors, deciding if it applies might be up to the user? Reshape could be handled by generally not computing the metric for very small integer vectors?

the end goal is to support FLOPS

Trying to better understand this. FLOPS would be computed at the node level, not at the tensor level? Is access to the tensor data needed to compute FLOPS?

kylesayrs commented 2 months ago

One idea would be that format-specific implementation can disable by returning a value like undefined or NaN

I think this is a good idea too. Setting these NaNs wouldn't require decoding the values either, only knowing which node it belongs to. This would have a to be a slow rollout feature, since there are too many cases to consider.

Would sparsity exist for nodes as well? FLOPS would be computed at the node level, not at the tensor level

I think this is a good point. It seems like parameter sparsity only applies to tensors, but operations (and operation sparsity) only applies to nodes. Separating out the two seems to solve a lot of the ambiguity about how each should be applied

For tensors, deciding if it applies might be up to the user?

I think this could be fine if we give the user the functionality to query for operation sparsity, and maybe a disclaimer about trying to interpret parameter (tensor) sparsity. In general, when coloring the graph and querying the graph, we should point people towards using operation and operation sparsity. Infact, coloring a node by parameter sparsity doesn't make much sense, since each node can have multiple tensors (parameters)

EDIT:

Is access to the tensor data needed to compute FLOPS?

I need to think more about this one. For all the cases I can think of, I think can get away with only needing the tensor sparsity (or block sparsity). Note that in order to compute FLOPS, we need to know input sizes

kylesayrs commented 2 months ago

We can start by implementing context-unaware tensor sparsity on the frontend. We should disclaimer with a help tool that indicates that sparsity may not always apply. We are then free to slow rollout format-specific NaNs and zero-point quantization cases.

In terms of implementing node FLOPs, this is a little bit harder. FLOPs calculations clearly need to be operation-type specific, and I think most operations can be calculated using the parameter sparsity metric alone. I can research this a little more and get back to you.

kylesayrs commented 2 months ago

Is access to the tensor data needed to compute FLOPS?

After asking around, it seems like some sparsity-aware runtimes such as the DeepSparse engine do skip padding operations, meaning that the actual positioning of the zeros affects the total number of operations. This may or may not be a factor we include.

lutzroeder / netron

Metrics metadata support #1240