`jupyter-keras-tract-tf2` - different inference result for tract and tensorflow

LukeChannings commented 1 year ago

I have not been able to produce the same inference result between the Python tensorflow model and a tract.

I'm using the pre-trained models jupyter-keras-tract-tf2/example.onnx and jupyter-keras-tract-tf2/my_model.

In Python:

from numpy import array
keras.models.load_model("my_model")

X = array([0.66092634, 0.53683096, 0.087038696, 0.7127285, 0.05790156, 0.6064996, 0.55041677, 0.92463154, 0.90465206, 0.21869498, 0.88987905, 0.2901256, 0.92607194, 0.26502395, 0.22824293, 0.51504177, 0.29867214, 0.26852566, 0.20949501, 0.08402729, 0.8089553, 0.011737704, 0.6567144, 0.71441704, 0.7048285, 0.88621587, 0.19271958, 0.20212299, 0.43037504, 0.70932204, 0.005234003, 0.4231646, 0.61477447, 0.5878512, 0.8295748, 0.42726552, 0.2863956, 0.8885641, 0.54117906, 0.38124472, 0.37213033, 0.8387098, 0.59386516, 0.41340268, 0.10855943, 0.5254563, 0.50082266, 0.5789962, 0.6250877, 0.39174867, 0.13295609, 0.03086859, 0.50527656, 0.40591103, 0.549339, 0.14018834, 0.8138606, 0.30564308, 0.56209683, 0.4143976, 0.78644603, 0.7499952, 0.45428842, 0.05267471, 0.44018543, 0.09318435, 0.8334928, 0.74107134, 0.14123714, 0.21377355, 0.85706604, 0.813123, 0.4717164, 0.5889254, 0.75140196, 0.729333, 0.9936066, 0.3789705, 0.31135952, 0.66729045, 0.892042, 0.57193124, 0.59434617, 0.38983184, 0.34458542, 0.77911943, 0.5056245, 0.59427005, 0.95920986, 0.5499504, 0.88092035, 0.9810167, 0.27594292, 0.7326084, 0.613173, 0.06695807, 0.63387024, 0.31066996, 0.5380825, 0.044705212])
X = X.reshape((1, 100))
new_model = load_model("my_model")

predictions = new_model.predict(X)

print(f"result: {predictions}")

Which outputs:

result: [[0.84874904]]

And in Rust:

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let model = tract_onnx::onnx()
        // load the model
        .model_for_path("./example.onnx")?
        // make the model runnable and fix its inputs and outputs
        .into_runnable()?;

    // Generate some input data for the model
    let x: [f32; 100] = [ 0.66092634, 0.53683096, 0.087038696, 0.7127285, 0.05790156, 0.6064996, 0.55041677, 0.92463154, 0.90465206, 0.21869498, 0.88987905, 0.2901256, 0.92607194, 0.26502395, 0.22824293, 0.51504177, 0.29867214, 0.26852566, 0.20949501, 0.08402729, 0.8089553, 0.011737704, 0.6567144, 0.71441704, 0.7048285, 0.88621587, 0.19271958, 0.20212299, 0.43037504, 0.70932204, 0.005234003, 0.4231646, 0.61477447, 0.5878512, 0.8295748, 0.42726552, 0.2863956, 0.8885641, 0.54117906, 0.38124472, 0.37213033, 0.8387098, 0.59386516, 0.41340268, 0.10855943, 0.5254563, 0.50082266, 0.5789962, 0.6250877, 0.39174867, 0.13295609, 0.03086859, 0.50527656, 0.40591103, 0.549339, 0.14018834, 0.8138606, 0.30564308, 0.56209683, 0.4143976, 0.78644603, 0.7499952, 0.45428842, 0.05267471, 0.44018543, 0.09318435, 0.8334928, 0.74107134, 0.14123714, 0.21377355, 0.85706604, 0.813123, 0.4717164, 0.5889254, 0.75140196, 0.729333, 0.9936066, 0.3789705, 0.31135952, 0.66729045, 0.892042, 0.57193124, 0.59434617, 0.38983184, 0.34458542, 0.77911943, 0.5056245, 0.59427005, 0.95920986, 0.5499504, 0.88092035, 0.9810167, 0.27594292, 0.7326084, 0.613173, 0.06695807, 0.63387024, 0.31066996, 0.5380825, 0.044705212, ];

    let input = tract_ndarray::arr1(&x)
        .into_shape((1, 100))
        .unwrap()
        .into_tensor();

    // Input the generated data into the model
    let result = model.run(tvec![input.into()]).unwrap();
    let to_show = result[0].to_array_view::<f32>()?;
    println!("result: {to_show:?}");
    Ok(())
}

Which outputs:

result: [[0.47968265]], shape=[1, 1], strides=[1, 1], layout=CFcf (0xf), dynamic ndim=2

Am I doing the inference incorrectly?

I have trained the model myself using the same example and also get inconsistent results with the same input data.

Setup defaults:

Tract version: 0.20.18
Tensorflow version: 2.13.0
Host: macOS, aarch64

kali commented 1 year ago

Hey @LukeChannings, thanks for the feedback. I revisited the example, and chose to dramatically simplify it (discarding jupyter in favor of a simple python script), while covering more tract usage and possibilities (specifically the python bindings). Please have a look at https://github.com/sonos/tract/tree/main/examples/keras-tract-tf2 now, and tell me if it makes more sense.

LukeChannings commented 1 year ago

Hey @kali, thanks for this.

I ran it and it works as expected.

I'm running everything on an M1 Mac (unconventional for ML work I know), and have discovered that the culprit for my broken output was the tensorflow-metal package being installed.

Below is my terminal output, I run without tensorflow-metal, then I install tensorflow-metal and run again, and I get an AssertionError on the results, which I also print to the console.

I'm not sure if this is your problem or theirs, but thought it was worth feeding back.

(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ python ./example.py
Epoch 1/10
32/32 [==============================] - 0s 590us/step - loss: 3.3521 - accuracy: 0.4790
Epoch 2/10
32/32 [==============================] - 0s 564us/step - loss: 1.4683 - accuracy: 0.5010
Epoch 3/10
32/32 [==============================] - 0s 535us/step - loss: 1.4151 - accuracy: 0.5120
Epoch 4/10
32/32 [==============================] - 0s 453us/step - loss: 1.0922 - accuracy: 0.5040
Epoch 5/10
32/32 [==============================] - 0s 449us/step - loss: 1.0056 - accuracy: 0.4940
Epoch 6/10
32/32 [==============================] - 0s 431us/step - loss: 0.8705 - accuracy: 0.5140
Epoch 7/10
32/32 [==============================] - 0s 423us/step - loss: 0.8001 - accuracy: 0.5170
Epoch 8/10
32/32 [==============================] - 0s 428us/step - loss: 0.7715 - accuracy: 0.5140
Epoch 9/10
32/32 [==============================] - 0s 474us/step - loss: 0.7407 - accuracy: 0.5160
Epoch 10/10
32/32 [==============================] - 0s 756us/step - loss: 0.7151 - accuracy: 0.5320
Could not search for non-variable resources. Concrete function internal representation may have changed.
1/1 [==============================] - 0s 32ms/step
[[0.49810514]] [[0.49810514]]
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ poetry add tensorflow-metal
Using version ^1.0.1 for tensorflow-metal

Updating dependencies
Resolving dependencies... (0.1s)

Package operations: 1 install, 0 updates, 0 removals

  • Installing tensorflow-metal (1.0.1)

Writing lock file
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ poetry show tensorflow
 name         : tensorflow                                                            
 version      : 2.13.0                                                                
 description  : TensorFlow is an open source machine learning framework for everyone. 

dependencies
 - absl-py >=1.0.0
 - astunparse >=1.6.0
 - flatbuffers >=23.1.21
 - gast >=0.2.1,<=0.4.0
 - google-pasta >=0.1.1
 - grpcio >=1.24.3,<2.0
 - h5py >=2.9.0
 - keras >=2.13.1,<2.14
 - libclang >=13.0.0
 - numpy >=1.22,<=1.24.3
 - opt-einsum >=2.3.2
 - packaging *
 - protobuf >=3.20.3,<4.21.0 || >4.21.0,<4.21.1 || >4.21.1,<4.21.2 || >4.21.2,<4.21.3 || >4.21.3,<4.21.4 || >4.21.4,<4.21.5 || >4.21.5,<5.0.0dev
 - setuptools *
 - six >=1.12.0
 - tensorboard >=2.13,<2.14
 - tensorflow-estimator >=2.13.0,<2.14
 - tensorflow-io-gcs-filesystem >=0.23.1
 - termcolor >=1.1.0
 - typing-extensions >=3.6.6,<4.6.0
 - wrapt >=1.11.0
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ python ./example.py
2023-09-08 13:32:23.190032: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2023-09-08 13:32:23.190064: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 64.00 GB
2023-09-08 13:32:23.190071: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 24.00 GB
2023-09-08 13:32:23.190228: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:23.190262: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/10
2023-09-08 13:32:23.984648: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
32/32 [==============================] - 6s 61ms/step - loss: 5.1759 - accuracy: 0.5060
Epoch 2/10
32/32 [==============================] - 0s 10ms/step - loss: 3.6960 - accuracy: 0.5050
Epoch 3/10
32/32 [==============================] - 0s 9ms/step - loss: 3.4854 - accuracy: 0.4850
Epoch 4/10
32/32 [==============================] - 0s 9ms/step - loss: 3.5844 - accuracy: 0.4980
Epoch 5/10
32/32 [==============================] - 0s 9ms/step - loss: 3.9290 - accuracy: 0.5090
Epoch 6/10
32/32 [==============================] - 0s 9ms/step - loss: 3.2427 - accuracy: 0.4880
Epoch 7/10
32/32 [==============================] - 0s 9ms/step - loss: 3.6768 - accuracy: 0.5010
Epoch 8/10
32/32 [==============================] - 0s 9ms/step - loss: 2.9930 - accuracy: 0.4750
Epoch 9/10
32/32 [==============================] - 0s 9ms/step - loss: 3.1644 - accuracy: 0.4960
Epoch 10/10
32/32 [==============================] - 0s 9ms/step - loss: 3.2300 - accuracy: 0.5020
Could not search for non-variable resources. Concrete function internal representation may have changed.
2023-09-08 13:32:32.353272: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-09-08 13:32:32.353539: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.353552: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.371508: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.371529: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.375233: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-09-08 13:32:32.375402: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.375415: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.454222: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
1/1 [==============================] - 0s 265ms/step
[[0.89284575]] [[-0.2692748]]
Traceback (most recent call last):
  File "/Users/luke/Downloads/tract/examples/keras-tract-tf2/./example.py", line 40, in <module>
    assert(np.allclose(tf_output, tract_output))
AssertionError
(keras-tract-tf2-py3.9) keras-tract-tf2:main*

kali commented 1 year ago

Both my pro and personal machines are Apple silicon, so no stigma attached here :) tract even have some Apple Silicon optimized multiplying kernel.

I can not see any sensible how the way presence or absence of tensorflow-metal could alter tract behaviour as it is very self-contained. Before I blame tensorflow-metal, maybe you can fix the input instead of using my random one and compare tensorflow result with and w/o tensorflow-metal ? It may be "just" a maginal rounding issue that diverged beyond numpy default comparison tolerance.

LukeChannings commented 1 year ago

It's definitely not a margin problem - the version using w/o tensorflow-metal outputs [[0.49810514]] (TF inference) and [[0.49810514]] (Tract inference), but the tensorflow-metal version outputs [[0.89284575]] and [[-0.2692748]] respectively.

Something must be very broken since the activation function on the output is sigmoid, and -0.2692748 is out of the sigmoid function's range.

kali commented 1 year ago

I can only assume the intermediate ONNX file is somewhat altered or broken when tensorflow-metal enters the dance. But I don't know often of tensorflow to help you much there. I would start by looking at the ONNX file (with tract command line or netron or something else) and try to figure out what happened.

LukeChannings commented 1 year ago

I'll close this, since the original question is answered, I'll open another issue if further investigation implicates Tract.

Thanks for your help!

sonos / tract

`jupyter-keras-tract-tf2` - different inference result for tract and tensorflow #1178