Open RhysU opened 2 years ago
I should mention...
The non-determinism isn't an issue because people often SerializeToString()
twice in succession in one process.
It's a problem because it makes it hard to compare the output from 2 training pipelines to confirm that nothing changed.
A couple of functions use a static counter to give a unique names to variables. The structure is the same but some names are different. There is no easy way to compare two onnx graphs.
Would you please point me to the location of the static counters in the code? Why are they static?
If the counters are controlled an equality check for two ONNX graphs becomes simple because one can compare bytes.
My original recreate, serializing twice in succession, would produce a false positive in the event of "static", module-level data. Meaning, some variable in some module that's initialized in a deterministic fashion then mutated by multiple calls to sklearn_convert(...)
. However, the recreate that I now include will not false positive in that circumstance.
Below, I am computing the MD5 of the ExtraTreesRegressor and the MD5 of the serialized ONNX bytes. I run the recreate multiple times, each in its own CPython process. I unexpectedly see the ONNX bytes MD5 change between CPython invocations. A "static counter" does not explain the following behavior unless the static counter is non-deterministically initialized:
$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: 6b599277950e694a6a76372ddc981cda
$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: 3331a36a5c49502d5a6b01212344e35d
$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: c0972feb50a43eba5c3a7a326b5567a1
The code:
from hashlib import md5
from numpy import frombuffer, uint8
from numpy.random import RandomState
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import export_text
if __name__ == "__main__":
# Fit on synthetic, deterministic data
etr = ExtraTreesRegressor(random_state=0, n_estimators=5)
state = RandomState(4)
X = state.normal(size=(100, 4))
y = state.normal(size=(100, ))
etr.fit(X, y)
# Compute an MD5 hash over the tree's textual representation
h = md5()
for estimator in etr.estimators_:
t = export_text(estimator) # Appears to not be cached but check me!
w = t.encode(encoding='UTF-8', errors='strict')
h.update(w)
etr_md5 = h.hexdigest()
print(f"ExtraTreesRegressor MD5: {etr_md5}")
# Convert the object to ONNX then serialize to bytes
initial_types = [("float_input", FloatTensorType([None, 4]))]
o = convert_sklearn(etr, initial_types=initial_types)
s = o.SerializeToString()
b = frombuffer(s, dtype=uint8)
# Compute an MD5 hash over the bytes
h = md5()
h.update(b.tobytes())
b_md5 = h.hexdigest()
print(f"bytes MD5: {b_md5}")
Why would a static counter be non-deterministically initialized?
Reproduced on https://github.com/onnx/sklearn-onnx/commit/c730c44fbba7112d4d61a4da76dcb3a5d14c6eee.
The converter needs to give a unique name to every intermediate results. If a pipeline contain two scalers, the second one must have a different result name. The static count is used to make it different. Every new name goes through generate_unique_name. There might be some use of the function id(.)
too. We never checked that two consecutive runs give the exact same results. But that would probably be a good idea to check that is true.
Thanks @xadupre . The uses of _generate_unique_name
do not explain process-to-process variability because the existing_names
set would not be carried across multiple processes. I'm referring to call sites:
Grepping for id(...)
suggests that it is more likely the root cause:
$ ag -Ql 'id(' skl2onnx
skl2onnx/_parse.py
skl2onnx/algebra/complex_functions.py
skl2onnx/algebra/graph_state.py
skl2onnx/algebra/onnx_operator.py
skl2onnx/common/_container.py
skl2onnx/common/_topology.py
skl2onnx/common/utils_sklearn.py
skl2onnx/convert.py
skl2onnx/operator_converters/bagging.py
skl2onnx/operator_converters/calibrated_classifier_cv.py
skl2onnx/operator_converters/grid_search_cv.py
skl2onnx/operator_converters/linear_classifier.py
skl2onnx/operator_converters/linear_regressor.py
skl2onnx/operator_converters/one_vs_rest_classifier.py
skl2onnx/operator_converters/pipelines.py
skl2onnx/operator_converters/stacking.py
skl2onnx/sklapi/woe_transformer_onnx.py
It's probably worth auditing and removing id(...)
usage. Everywhere outside of the logger.debug(...)
statements feels suspect. Per https://docs.python.org/3/library/functions.html#id "Two objects with non-overlapping lifetimes may have the same id() value." so it's not as universal as one might hope even within a single CPython process.
In our case, we assume the pipeline remains alive until the conversion ends. The use of id should not produce any conflicts. But I agree, id(.) should be removed.
I'm not sure what timeline I would be able to get to it, but would you entertain merge requests that gradually remove id(...)
?
Sure.
Calling
convert_sklearn(...)
twice on the same tree-based model can result in two differentSerializeToString()
outputs. I expect the method to be deterministic (aka pure)-- meaning that calling it multiple times produces the same output each time.To reproduce, against default apply this diff to https://github.com/onnx/sklearn-onnx/blob/main/tests/test_utils/tests_helper.py#L474:
Then run:
You'll see the assertion trip in
TestSklearnDecisionTreeModels.test_extra_tree_classifier
. Having stared at this, the differences do not seem to be deterministic in that one can callsklearn_convert(...)
multiple times and each time one gets a different serialized representation.Generally, sprinkling assertions of that nature before every
dump_data_and_model(...)
invocation in https://github.com/onnx/sklearn-onnx/blob/main/tests/test_utils/tests_helper.py causes quite a few assertion trips across the entire skl2onnx test suite.