iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 611 forks source link

[CUDA] Numerical issue with Albert static #9553

Closed powderluv closed 2 years ago

powderluv commented 2 years ago

What happened?

Splitting out the CPU issue reported earlier in https://github.com/google/iree/issues/9536

SAVE_TEMPS is at : gs://iree-shared-files/nod-perf/anush/iree_save_temps_albert_cuda/

E       assert True == False                                                                                
E        +  where False = compare_tensors_tf(<tf.Tensor: shape=(1, 16, 30000), dtype=float32, numpy=\narray([[[  6.525624  ,  -0.86272705, -10.089135  , ..., -10.213295  ,\n          -6.750144  ,  -6.3927426 ],\n    
    [ -1.5267774 ,  -1.6814287 ,  -0.10579234, ...,  -3.0765162 ,\n          -3.1644928 ,  -1.4910223 ],\n        [  2.685427  ,   3.933254  ,  -0.57486296, ...,  -6.191433  ,\n          -5.443107  ,  -5.698504  ],\n
        ...,\n        [ -6.821984  ,   3.051909  ,  -0.40128312, ...,  -6.001742  ,\n          -2.25048   ,   0.5378816 ],\n        [ -4.226172  ,   3.037271  ,   1.403521  , ...,  -5.54635   ,\n          -3.1418035 
,  -0.06469041],\n        [ -5.6892867 ,   3.1330695 ,   0.43856078, ...,  -5.7802396 ,\n          -2.7348974 ,   0.13402474]]], dtype=float32)>, array([[[  6.530001  ,  -0.8645317 , -10.071699  , ..., -10.202567  ,\
n          -6.748335  ,  -6.389909  ],\n        [ -1.5185928 ,  -1.6757947 ,  -0.1048108 , ...,  -3.064147  ,\n          -3.1670513 ,  -1.489428  ],\n        [  2.6783054 ,   3.935619  ,  -0.58204937, ...,  -6.198863
  ,\n          -5.439238  ,  -5.710302  ],\n        ...,\n        [ -6.8165307 ,   3.05401   ,  -0.3954903 , ...,  -5.9955688 ,\n          -2.251053  ,   0.53927326],\n        [ -4.228984  ,   3.0336683 ,   1.4009475
 , ...,  -5.5458055 ,\n          -3.1387935 ,  -0.05896723],\n        [ -5.690269  ,   3.1312835 ,   0.43961346, ...,  -5.775498  ,\n          -2.731987  ,   0.13719574]]], dtype=float32))                            

Steps to reproduce your issue

On a nod VM image run:

git clone https://github.com/nod-ai/SHARK.git
PYTHON=python3.10 VENV_DIR=0617_venv IMPORTER=1 USE_IREE=1 ./setup_venv.sh 
source 0617_venv/bin/activate
pytest tank/tf/hf_masked_lm/albert-base-v2_test.py::AlbertBaseModuleTest::test_module_static_gpu

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

No response

powderluv commented 2 years ago

I tested all the way back to https://github.com/google/iree/releases/tag/candidate-20220604.159 and it fails the same way.

allieculp commented 2 years ago

@ThomasRaoux any update here to share in tomorrow's sync?

ThomasRaoux commented 2 years ago

Sorry I thought I had updated this bug. So the problem goes away when turning off TF32 in tensorflow therefore it is a problem with floating point precision mode. Those kind of gaps are expected. @powderluv what do you think is the next step here?

allieculp commented 2 years ago

@ThomasRaoux from meeting looking at the deltas here, please give an update on the issue.

monorimet commented 2 years ago

This is resolved by https://github.com/nod-ai/SHARK/pull/199.

allieculp commented 2 years ago

@powderluv @monorimet can we close?

monorimet commented 2 years ago

Yes, we have found that the problem was not with IREE.