iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.83k stars 609 forks source link

[spirv] Inaccurate TF ConvBert result on Apple M GPUs #9971

Closed PhaneeshB closed 1 year ago

PhaneeshB commented 2 years ago

What happened?

On comparing the results obtained from TensorFlow with SHARK results the difference is more than the tolerance range. Following is the error message shown:

 AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 501275 / 512000 (97.9%)
Max absolute difference: 4.017738
Max relative difference: 243874.06
 x: array([[[ 1.43804 , -1.28011 ,  0.285097, ..., -2.139163, -0.606236,
         -1.118984],
        [-1.287601,  0.407412, -0.379824, ...,  0.158129,  1.626622,...
 y: array([[[ 1.871550e+00, -1.336534e+00,  1.800059e-01, ...,
         -1.499211e+00, -8.562328e-01, -1.358510e-01],
        [-1.766060e+00,  6.850528e-01, -2.355140e-01, ...,...

Steps to reproduce your issue

The error can be reproduced using the following script:

from shark.shark_inference import SharkInference
from shark.shark_downloader import download_tf_model
import numpy as np

if __name__ == "__main__":
    model, func_name, inputs, golden_out = download_tf_model("dbmdz/convbert-base-turkish-cased")

    shark_module = SharkInference(
        model, func_name, device="vulkan", mlir_dialect="mhlo"
    )

    shark_module.compile()
    result = shark_module.forward(inputs)
    np.testing.assert_allclose(golden_out, result, rtol=1e-02, atol=1e-03)

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

  1. VulkanSDK needs to be installed on the system.
  2. IREE built from source code with Vulkan flags enabled is also able to reproduce the error.
  3. To execute with m1-moltenvk-macos target triple on Apple M2, please make the following change in the SHARK source code file : https://github.com/nod-ai/SHARK/blob/d556c0d6ef8f69b32bc3b2d28165345dd2faf403/shark/iree_utils/vulkan_utils.py#L23

replace : if vulkan_device == "M1": with : if vulkan_device == "M1" or vulkan_device == "M2":

powderluv commented 2 years ago

please attach a link to the .mlir and iree command line to execute / recreate it

antiagainst commented 2 years ago

+1. It would be much easier for me to look into the issue with an input mlir file. Also as @stellaraccident asked in the other issue, is this specific to M2? (I'd suspect not but need to double check.)

PhaneeshB commented 2 years ago

please attach a link to the .mlir and iree command line to execute / recreate it

Command :

<PATH TO ..../iree-compile> - --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=<PATH TO ..../iree-lld> --mlir-print-debuginfo --mlir-print-op-on-diagnostic=true  --iree-llvm-target-cpu-features=host --iree-mhlo-demote-i64-to-i32=false --iree-flow-demote-i64-to-i32 -iree-vulkan-target-triple=m1-moltenvk-macos --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64

Input MLIR: https://storage.googleapis.com/shark_tank/dbmdz_convbert-base-turkish-cased_tf/dbmdz_convbert-base-turkish-cased_tf.mlir

PhaneeshB commented 2 years ago

@antiagainst We checked and found that this issue is also present on M1 Vulkan as suspected

antiagainst commented 2 years ago

We are seeing the issue of different results with and without --iree-flow-trace-dispatch-tensors again:

local-task:

1x16x32000xf32=[[0.607319 -1.07316 0.898614 -0.267287 1.78744 -0.263523 1.01242 -0.2313 -2.19909 -2.82577 -2.44984 0.527114 -0.46196 0.275833 -1.16742 -0.420368 ...

vulkan (w/ tracing):

1x16x32000xf32=[[0.607278 -1.0732 0.898576 -0.267251 1.78746 -0.263542 1.01245 -0.231225 -2.19902 -2.82572 -2.44985 0.52708 -0.461983 0.275756 -1.1674 -0.420419 ...

vulkan (w/o tracing):

1x16x32000xf32=[[-0.698902 -1.57589 1.41733 0.851334 1.94573 -0.392987 1.02575 -0.61025 -3.48231 -3.52306 -1.55764 -0.271172 -0.189102 -0.334609 -0.209776 0.191701 

With it the result is correct. Last time it was gone but I guess we just got lucky. Still need to root cause it properly.

benvanik commented 2 years ago

You can try compiling with --iree-stream-partitioning-favor=debug which disables all concurrency and puts a barrier between each dispatch - that'd narrow down whether it was multiple dispatches stomping on each other or something host/device.

antiagainst commented 1 year ago

Closing this for now given this is Vulkan on MoltenVK -- we have native Metal support and that's the way forward.