Closed LukeChannings closed 1 year ago
Hey @LukeChannings, thanks for the feedback. I revisited the example, and chose to dramatically simplify it (discarding jupyter in favor of a simple python script), while covering more tract usage and possibilities (specifically the python bindings). Please have a look at https://github.com/sonos/tract/tree/main/examples/keras-tract-tf2 now, and tell me if it makes more sense.
Hey @kali, thanks for this.
I ran it and it works as expected.
I'm running everything on an M1 Mac (unconventional for ML work I know), and have discovered that the culprit for my broken output was the tensorflow-metal
package being installed.
Below is my terminal output, I run without tensorflow-metal
, then I install tensorflow-metal
and run again, and I get an AssertionError
on the results, which I also print to the console.
I'm not sure if this is your problem or theirs, but thought it was worth feeding back.
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ python ./example.py
Epoch 1/10
32/32 [==============================] - 0s 590us/step - loss: 3.3521 - accuracy: 0.4790
Epoch 2/10
32/32 [==============================] - 0s 564us/step - loss: 1.4683 - accuracy: 0.5010
Epoch 3/10
32/32 [==============================] - 0s 535us/step - loss: 1.4151 - accuracy: 0.5120
Epoch 4/10
32/32 [==============================] - 0s 453us/step - loss: 1.0922 - accuracy: 0.5040
Epoch 5/10
32/32 [==============================] - 0s 449us/step - loss: 1.0056 - accuracy: 0.4940
Epoch 6/10
32/32 [==============================] - 0s 431us/step - loss: 0.8705 - accuracy: 0.5140
Epoch 7/10
32/32 [==============================] - 0s 423us/step - loss: 0.8001 - accuracy: 0.5170
Epoch 8/10
32/32 [==============================] - 0s 428us/step - loss: 0.7715 - accuracy: 0.5140
Epoch 9/10
32/32 [==============================] - 0s 474us/step - loss: 0.7407 - accuracy: 0.5160
Epoch 10/10
32/32 [==============================] - 0s 756us/step - loss: 0.7151 - accuracy: 0.5320
Could not search for non-variable resources. Concrete function internal representation may have changed.
1/1 [==============================] - 0s 32ms/step
[[0.49810514]] [[0.49810514]]
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ poetry add tensorflow-metal
Using version ^1.0.1 for tensorflow-metal
Updating dependencies
Resolving dependencies... (0.1s)
Package operations: 1 install, 0 updates, 0 removals
• Installing tensorflow-metal (1.0.1)
Writing lock file
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ poetry show tensorflow
name : tensorflow
version : 2.13.0
description : TensorFlow is an open source machine learning framework for everyone.
dependencies
- absl-py >=1.0.0
- astunparse >=1.6.0
- flatbuffers >=23.1.21
- gast >=0.2.1,<=0.4.0
- google-pasta >=0.1.1
- grpcio >=1.24.3,<2.0
- h5py >=2.9.0
- keras >=2.13.1,<2.14
- libclang >=13.0.0
- numpy >=1.22,<=1.24.3
- opt-einsum >=2.3.2
- packaging *
- protobuf >=3.20.3,<4.21.0 || >4.21.0,<4.21.1 || >4.21.1,<4.21.2 || >4.21.2,<4.21.3 || >4.21.3,<4.21.4 || >4.21.4,<4.21.5 || >4.21.5,<5.0.0dev
- setuptools *
- six >=1.12.0
- tensorboard >=2.13,<2.14
- tensorflow-estimator >=2.13.0,<2.14
- tensorflow-io-gcs-filesystem >=0.23.1
- termcolor >=1.1.0
- typing-extensions >=3.6.6,<4.6.0
- wrapt >=1.11.0
(keras-tract-tf2-py3.9) keras-tract-tf2:main* λ python ./example.py
2023-09-08 13:32:23.190032: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2023-09-08 13:32:23.190064: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 64.00 GB
2023-09-08 13:32:23.190071: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 24.00 GB
2023-09-08 13:32:23.190228: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:23.190262: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/10
2023-09-08 13:32:23.984648: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
32/32 [==============================] - 6s 61ms/step - loss: 5.1759 - accuracy: 0.5060
Epoch 2/10
32/32 [==============================] - 0s 10ms/step - loss: 3.6960 - accuracy: 0.5050
Epoch 3/10
32/32 [==============================] - 0s 9ms/step - loss: 3.4854 - accuracy: 0.4850
Epoch 4/10
32/32 [==============================] - 0s 9ms/step - loss: 3.5844 - accuracy: 0.4980
Epoch 5/10
32/32 [==============================] - 0s 9ms/step - loss: 3.9290 - accuracy: 0.5090
Epoch 6/10
32/32 [==============================] - 0s 9ms/step - loss: 3.2427 - accuracy: 0.4880
Epoch 7/10
32/32 [==============================] - 0s 9ms/step - loss: 3.6768 - accuracy: 0.5010
Epoch 8/10
32/32 [==============================] - 0s 9ms/step - loss: 2.9930 - accuracy: 0.4750
Epoch 9/10
32/32 [==============================] - 0s 9ms/step - loss: 3.1644 - accuracy: 0.4960
Epoch 10/10
32/32 [==============================] - 0s 9ms/step - loss: 3.2300 - accuracy: 0.5020
Could not search for non-variable resources. Concrete function internal representation may have changed.
2023-09-08 13:32:32.353272: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-09-08 13:32:32.353539: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.353552: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.371508: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.371529: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.375233: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-09-08 13:32:32.375402: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-08 13:32:32.375415: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-08 13:32:32.454222: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
1/1 [==============================] - 0s 265ms/step
[[0.89284575]] [[-0.2692748]]
Traceback (most recent call last):
File "/Users/luke/Downloads/tract/examples/keras-tract-tf2/./example.py", line 40, in <module>
assert(np.allclose(tf_output, tract_output))
AssertionError
(keras-tract-tf2-py3.9) keras-tract-tf2:main*
Both my pro and personal machines are Apple silicon, so no stigma attached here :) tract even have some Apple Silicon optimized multiplying kernel.
I can not see any sensible how the way presence or absence of tensorflow-metal could alter tract behaviour as it is very self-contained. Before I blame tensorflow-metal, maybe you can fix the input instead of using my random one and compare tensorflow result with and w/o tensorflow-metal ? It may be "just" a maginal rounding issue that diverged beyond numpy default comparison tolerance.
It's definitely not a margin problem - the version using w/o tensorflow-metal
outputs [[0.49810514]]
(TF inference) and [[0.49810514]]
(Tract inference), but the tensorflow-metal
version outputs [[0.89284575]]
and [[-0.2692748]]
respectively.
Something must be very broken since the activation function on the output is sigmoid, and -0.2692748
is out of the sigmoid function's range.
I can only assume the intermediate ONNX file is somewhat altered or broken when tensorflow-metal enters the dance. But I don't know often of tensorflow to help you much there. I would start by looking at the ONNX file (with tract command line or netron or something else) and try to figure out what happened.
I'll close this, since the original question is answered, I'll open another issue if further investigation implicates Tract.
Thanks for your help!
I have not been able to produce the same inference result between the Python tensorflow model and a tract.
I'm using the pre-trained models jupyter-keras-tract-tf2/example.onnx and jupyter-keras-tract-tf2/my_model.
In Python:
Which outputs:
And in Rust:
Which outputs:
Am I doing the inference incorrectly?
I have trained the model myself using the same example and also get inconsistent results with the same input data.
Setup defaults: