microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.09k stars 2.84k forks source link

Same Model Hash Code Issue from different models #21672

Open geekadalovelace opened 1 month ago

geekadalovelace commented 1 month ago

Describe the issue

If given models have the same architecture and identical input/output tensor names for each node, they will generate the same model hash code. Even if the model structure is the same, different weights and shapes should result in different models and thus develop different hash codes.

The hash code is only dependent on the names. https://github.com/microsoft/onnxruntime/blob/d616025884da05368c38270338b1ab3698e0ecb6/onnxruntime/core/framework/model_metadef_id_generator.cc#L52

To reproduce

Two models with the same architecture but different weights generate the same model hash code.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

v1.10

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

skottmckay commented 4 weeks ago

Is this a real world issue or theoretical? This would only occur if the same instance of the EP was loaded in multiple sessions, and it was a compiling EP. Hashing the weights would add a huge cost.

geekadalovelace commented 4 weeks ago

Is this a real world issue or theoretical? This would only occur if the same instance of the EP was loaded in multiple sessions, and it was a compiling EP. Hashing the weights would add a huge cost.

This is a real-world issue. I have models with the same architecture but trained with different channel sizes. Also, there may be models with the same architecture but weights trained for different training objectives. Compiling models is time-consuming, so I cache the compilation results and use the hash code as the key for the cache.

I modified the code to hash the weights and observed that the time to generate the hash code increases proportionally with the model size. I need smart solutions to address this problem.

skottmckay commented 2 weeks ago

Intended usage of ModelMetadefIdGenerator was to create a deterministic yet unique hash that can be used in the name of the node containing the compiled model to make it easier to debug issues. It wasn't intended as a cache key hash.

Where is the caching code? I assume ORT isn't handling that so it's not clear why the ModelMetadefIdGenerator hash needs to be used as the cache key.