onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
557 stars 104 forks source link

convert_sklearn(ExtraTreeClassifier) is not deterministic #881

Open RhysU opened 2 years ago

RhysU commented 2 years ago

Calling convert_sklearn(...) twice on the same tree-based model can result in two different SerializeToString() outputs. I expect the method to be deterministic (aka pure)-- meaning that calling it multiple times produces the same output each time.

To reproduce, against default apply this diff to https://github.com/onnx/sklearn-onnx/blob/main/tests/test_utils/tests_helper.py#L474:

diff --git a/tests/test_utils/tests_helper.py b/tests/test_utils/tests_helper.py
index 8fdf607..56d6e07 100644
--- a/tests/test_utils/tests_helper.py
+++ b/tests/test_utils/tests_helper.py
@@ -494,6 +494,13 @@ def dump_one_class_classification(
     model_onnx, prefix = convert_model(
         model, "one_class", [("input", FloatTensorType([None, 2]))],
         target_opset=target_opset)
+    # BEGIN Deterministic?
+    model_onnx2, prefix2 = convert_model(
+        model, "one_class", [("input", FloatTensorType([None, 2]))],
+        target_opset=target_opset)
+    assert model_onnx.SerializeToString() == model_onnx2.SerializeToString(), "Deterministic?"
+    del model_onnx2, prefix2
+    # END Deterministic?
     dump_data_and_model(
         X,
         model,

Then run:

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python -m pytest tests/test_sklearn_decision_tree_converters.py

You'll see the assertion trip in TestSklearnDecisionTreeModels.test_extra_tree_classifier. Having stared at this, the differences do not seem to be deterministic in that one can call sklearn_convert(...) multiple times and each time one gets a different serialized representation.

Generally, sprinkling assertions of that nature before every dump_data_and_model(...) invocation in https://github.com/onnx/sklearn-onnx/blob/main/tests/test_utils/tests_helper.py causes quite a few assertion trips across the entire skl2onnx test suite.

RhysU commented 2 years ago

I should mention...

The non-determinism isn't an issue because people often SerializeToString() twice in succession in one process. It's a problem because it makes it hard to compare the output from 2 training pipelines to confirm that nothing changed.

xadupre commented 2 years ago

A couple of functions use a static counter to give a unique names to variables. The structure is the same but some names are different. There is no easy way to compare two onnx graphs.

RhysU commented 2 years ago

Would you please point me to the location of the static counters in the code? Why are they static?

If the counters are controlled an equality check for two ONNX graphs becomes simple because one can compare bytes.

RhysU commented 2 years ago

My original recreate, serializing twice in succession, would produce a false positive in the event of "static", module-level data. Meaning, some variable in some module that's initialized in a deterministic fashion then mutated by multiple calls to sklearn_convert(...). However, the recreate that I now include will not false positive in that circumstance.

Below, I am computing the MD5 of the ExtraTreesRegressor and the MD5 of the serialized ONNX bytes. I run the recreate multiple times, each in its own CPython process. I unexpectedly see the ONNX bytes MD5 change between CPython invocations. A "static counter" does not explain the following behavior unless the static counter is non-deterministically initialized:

$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: 6b599277950e694a6a76372ddc981cda

$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: 3331a36a5c49502d5a6b01212344e35d

$ PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python ~/recreates/skl2onnx-881.py
ExtraTreesRegressor MD5: 9c90ae76a13e3d55c0b038f9fdbd0921
bytes MD5: c0972feb50a43eba5c3a7a326b5567a1

The code:

from hashlib import md5
from numpy import frombuffer, uint8
from numpy.random import RandomState
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import export_text

if __name__ == "__main__":
    # Fit on synthetic, deterministic data
    etr = ExtraTreesRegressor(random_state=0, n_estimators=5)
    state = RandomState(4)
    X = state.normal(size=(100, 4))
    y = state.normal(size=(100, ))
    etr.fit(X, y)

    # Compute an MD5 hash over the tree's textual representation
    h = md5()
    for estimator in etr.estimators_:
        t = export_text(estimator)  # Appears to not be cached but check me!
        w = t.encode(encoding='UTF-8', errors='strict')
        h.update(w)
    etr_md5 = h.hexdigest()
    print(f"ExtraTreesRegressor MD5: {etr_md5}")

    # Convert the object to ONNX then serialize to bytes
    initial_types = [("float_input", FloatTensorType([None, 4]))]
    o = convert_sklearn(etr, initial_types=initial_types)
    s = o.SerializeToString()
    b = frombuffer(s, dtype=uint8)

    # Compute an MD5 hash over the bytes
    h = md5()
    h.update(b.tobytes())
    b_md5 = h.hexdigest()
    print(f"bytes MD5: {b_md5}")

Why would a static counter be non-deterministically initialized?

Reproduced on https://github.com/onnx/sklearn-onnx/commit/c730c44fbba7112d4d61a4da76dcb3a5d14c6eee.

xadupre commented 2 years ago

The converter needs to give a unique name to every intermediate results. If a pipeline contain two scalers, the second one must have a different result name. The static count is used to make it different. Every new name goes through generate_unique_name. There might be some use of the function id(.) too. We never checked that two consecutive runs give the exact same results. But that would probably be a good idea to check that is true.

RhysU commented 2 years ago

Thanks @xadupre . The uses of _generate_unique_name do not explain process-to-process variability because the existing_names set would not be carried across multiple processes. I'm referring to call sites:

  1. https://github.com/onnx/sklearn-onnx/blob/c730c44fbba7112d4d61a4da76dcb3a5d14c6eee/skl2onnx/common/_topology.py#L671-L680
  2. https://github.com/onnx/sklearn-onnx/blob/c730c44fbba7112d4d61a4da76dcb3a5d14c6eee/skl2onnx/common/_topology.py#L704-L716
  3. https://github.com/onnx/sklearn-onnx/blob/c730c44fbba7112d4d61a4da76dcb3a5d14c6eee/skl2onnx/common/_topology.py#L997-L998

Grepping for id(...) suggests that it is more likely the root cause:

$ ag -Ql 'id(' skl2onnx
skl2onnx/_parse.py
skl2onnx/algebra/complex_functions.py
skl2onnx/algebra/graph_state.py
skl2onnx/algebra/onnx_operator.py
skl2onnx/common/_container.py
skl2onnx/common/_topology.py
skl2onnx/common/utils_sklearn.py
skl2onnx/convert.py
skl2onnx/operator_converters/bagging.py
skl2onnx/operator_converters/calibrated_classifier_cv.py
skl2onnx/operator_converters/grid_search_cv.py
skl2onnx/operator_converters/linear_classifier.py
skl2onnx/operator_converters/linear_regressor.py
skl2onnx/operator_converters/one_vs_rest_classifier.py
skl2onnx/operator_converters/pipelines.py
skl2onnx/operator_converters/stacking.py
skl2onnx/sklapi/woe_transformer_onnx.py

It's probably worth auditing and removing id(...) usage. Everywhere outside of the logger.debug(...) statements feels suspect. Per https://docs.python.org/3/library/functions.html#id "Two objects with non-overlapping lifetimes may have the same id() value." so it's not as universal as one might hope even within a single CPython process.

xadupre commented 2 years ago

In our case, we assume the pipeline remains alive until the conversion ends. The use of id should not produce any conflicts. But I agree, id(.) should be removed.

RhysU commented 2 years ago

I'm not sure what timeline I would be able to get to it, but would you entertain merge requests that gradually remove id(...)?

xadupre commented 2 years ago

Sure.