Low accuracy on Arc A380

ip2016 commented 2 years ago

It seems like XPU calculation accuracy deteriorates in 4th-5th digit after the dot on common math operation. Here is the sample code:

a = tf.random.normal(shape=[10000, 10000], dtype=tf.float32) 
b = tf.random.normal(shape=[10000, 10000], dtype=tf.float32)

@tf.function
def run(a, b):
    x1 = tf.nn.relu(a)
    y1 = tf.nn.relu(b)  
    x2 = tf.math.square(x1)
    y2 = tf.math.square(y1)
    x3 = tf.math.scalar_mul(33e-5, x2)
    y3 = tf.math.scalar_mul(33e-5, y2)
    return tf.tensordot(x3, y3, 2)

with tf.device("/XPU:0"):
    print(f"XPU Result: {run(a, b)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {run(a, b)}")

Which yields the following results:

XPU Result: 2.721888542175293
CPU Result: 2.5815889835357666

System: Asrock A380 Ubuntu 22.04 (kernel 5.17.0-1019-oem)

yiqianglee commented 2 years ago

@ip2016 thanks for reporting this issue. We will have a look, but first try in other Intel GPU, we can't reproduce this issue, will try on A380 also.

Tengfei09 commented 2 years ago

@ip2016 May I ask what kind of CPU you’re using？ We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.

ip2016 commented 2 years ago

@ip2016 May I ask what kind of CPU you’re using？ We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.

Hello @Tengfei09

Thanks for your response. Here is my system info:

>oneapi-cli version
v0.2.0-4-g9fef7bf786

>glxinfo -B
name of display: :0
hwconfig key 77 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 78 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 79 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 80 (UNKNOWN_INTEL_HWCONFIG) unhandled!
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) Graphics (DG2) (0x56a5)
    Version: 22.2.0
    Accelerated: yes
    Video memory: 6088MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) Graphics (DG2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.0-devel (git-44289c46d9)
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

>lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
    CPU family:          6
    Model:               165
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            5
    CPU max MHz:         4300.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5799.77
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx p
                         dpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 mo
                         nitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c 
                         rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
                          ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves 
                         dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

>lspci -v |grep -A8 VGA
03:00.0 VGA compatible controller: Intel Corporation Device 56a5 (rev 05) (prog-if 00 [VGA controller])
    Subsystem: ASRock Incorporation Device 6004
    Flags: bus master, fast devsel, latency 0, IRQ 144, IOMMU group 1
    Memory at a1000000 (64-bit, non-prefetchable) [size=16M]
    Memory at 4000000000 (64-bit, prefetchable) [size=8G]
    Expansion ROM at a2000000 [disabled] [size=2M]
    Capabilities: <access denied>
    Kernel driver in use: i915
    Kernel modules: i915

ip2016 commented 2 years ago

I did some additional testing, still getting incompatible results sometimes. Here is one which is reproducible:

import tensorflow as tf

tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]

tf.random.set_seed(11)
c = tf.random.normal([2, 2], 0, 1, tf.float32) 
d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")

[[-1.5229468   0.66954553]
 [-0.64246905  1.4300431 ]], [[ 0.35981855  1.018044  ]
 [-2.029798   -0.7807023 ]]

with tf.device("/XPU:0"):
    print(f"XPU Result: {tf.tensordot(c,d,2)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {tf.tensordot(c,d,2)}")

XPU Result: 0.32128676772117615
CPU Result: 0.321286678314209

Running the same code with tensorflow-cpu, I'm getting:

CPU Result: 0.32128584384918213

Virtual environment version:

>python --version
Python 3.10.6

>pip freeze
absl-py==1.3.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.1.0
astunparse==1.6.3
attrs==22.1.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
colorama==0.4.6
contourpy==1.0.5
cycler==0.11.0
Cython==0.29.32
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
dm-tree==0.1.7
entrypoints==0.4
etils==0.9.0
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.8.0
flatbuffers==22.10.26
fonttools==4.38.0
gast==0.4.0
gin-config==0.5.0
google-api-core==2.10.2
google-api-python-client==2.65.0
google-auth==2.13.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.4
grpcio==1.50.0
h5py==3.7.0
httplib2==0.21.0
huggingface-hub==0.10.1
idna==3.4
importlib-resources==5.10.0
intel-extension-for-tensorflow==1.0.0
intel-extension-for-tensorflow-lib==1.0.0.1
ipykernel==6.17.0
ipython==8.6.0
ipython-genutils==0.2.0
ipywidgets==8.0.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.16.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-server==1.21.0
jupyter_client==7.4.4
jupyter_core==4.11.2
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.3
kaggle==1.5.12
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
libclang==14.0.6
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.6.1
matplotlib-inline==0.1.6
mistune==2.0.4
nbclassic==0.4.5
nbclient==0.7.0
nbconvert==7.2.3
nbformat==5.7.0
nest-asyncio==1.5.6
notebook==6.5.1
notebook_shim==0.2.0
numpy==1.23.4
oauth2client==4.1.3
oauthlib==3.2.2
opencv-python-headless==4.6.0.66
opt-einsum==3.3.0
packaging==21.3
panda==0.3.1
pandas==1.5.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.3.0
portalocker==2.6.0
prometheus-client==0.15.0
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.19.6
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.5
pycparser==2.21
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
python-slugify==6.1.2
pytz==2022.5
PyYAML==6.0
pyzmq==24.0.1
qtconsole==5.3.2
QtPy==2.2.1
regex==2022.9.13
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
sacrebleu==2.3.1
scikit-learn==1.1.3
scipy==1.9.3
Send2Trash==1.8.0
sentencepiece==0.1.97
seqeval==1.2.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
stack-data==0.6.0
tabulate==0.9.0
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.0
tensorflow-addons==0.18.0
tensorflow-datasets==4.7.0
tensorflow-estimator==2.10.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.10.0
tensorflow-model-optimization==0.7.3
tensorflow-text==2.10.0
termcolor==2.0.1
terminado==0.17.0
text-unidecode==1.3
tf-models-official==2.7.0
tf-slim==1.1.0
threadpoolctl==3.1.0
tinycss2==1.2.1
tokenizers==0.13.1
toml==0.10.2
tornado==6.2
tqdm==4.64.1
traitlets==5.5.0
transformers==4.23.1
typeguard==2.13.3
typing_extensions==4.4.0
uritemplate==4.1.1
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
Werkzeug==2.2.2
widgetsnbextension==4.0.3
wrapt==1.14.1
zipp==3.10.0

ip2016 commented 2 years ago

I changed code a bit and now CPU result from the tensorflow intel plugin matches the result from tensorflow-cpu, but not the GPU result.

with tf.device("/CPU:0"):
    tf.random.set_seed(11)
    c = tf.random.normal([2, 2], 0, 1, tf.float32) 
    d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")

[[-1.5229472   0.66954476]
 [-0.6424697   1.4300429 ]], [[ 0.35981855  1.0180439 ]
 [-2.0297976  -0.7807032 ]]

with tf.device("/XPU:0"):
    print(f"XPU Result: {tf.tensordot(c,d,2)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {tf.tensordot(c,d,2)}")

XPU Result: 0.3212856948375702
CPU Result: 0.32128584384918213

yiqianglee commented 2 years ago

@ip2016 , good to see your latest result. For float point, I think this is reasonable, we can't expect bit-by-bit same in float point arithmetic, normally, we use relative tolerance and absolute tolerance to compare float point, here the tolerance is less than 1e-6 which is reasonable to me.

XPU Result: 0.3212856948375702 CPU Result: 0.32128584384918213

ip2016 commented 2 years ago

@yiqianglee Thanks for your input. The issue I'm facing is a NaN in loss function when I'm trying a simple project with BERT fine tuning. I suspect this is caused by "exploded gradient problem" due to accumulated/amplified accuracy error on Arc A380.

This is how it runs on CPU: (in progress)

Epoch 1/2
 30/459 [>.............................] - ETA: 7:28 - loss: 0.6868 - accuracy: 0.6208

And this is a XPU run (in progress):

Epoch 1/2
2022-11-01 10:23:42.250718: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type XPU is enabled.
 30/459 [>.............................] - ETA: 20:13 - loss: nan - accuracy: 0.7000

I also notices it runs much slower on GPU.

The "train" code is below (a simple example from huggingface):

# %%
import tensorflow as tf

print(tf.config.list_physical_devices())

# %%
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

# %%
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# %%
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(
    optimizer="adam",
    loss=loss,
    metrics=["accuracy"],
)
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=2
)

yiqianglee commented 2 years ago

@ip2016 Thanks for reporting, we can reproduce it now, working on fix.

Wanzizhu commented 2 years ago

Hi, @ip2016，PR fix for this NAN issue has been merge, please follow doc here to rebuild and have a try.

ip2016 commented 2 years ago

Thanks for the fast fix. After some unsuccessful attempts, I was able to build the package. I'm not getting NaN in loss function anymore. However, loss value seems to be a bit optimistic most of the time. For the example above I'm getting loss ~0.38 while for CPU and GPU (on google Colab) I'm getting around 0.65

I have 2 more issues that I'm not sure if these are bugs or limitations.

For certain datasets I'm getting "Out of Memory" (OOM) error. I tried to use tf.config.experimental.set_memory_growth but it didn't seem to work for XPU. Are there any options to overcome the limitation?
When fine tuned (trained) for BERT NLP with non-padded token vectors, the training time is much longer. The example above runs for about 10 minutes on Arc A380 (slower than on i5-10400) and from intel_gpu_top I can notice that the GPU is idle at least half of the time. But if I change line: return tokenizer(example["sentence1"], example["sentence2"], truncation=True) to: return tokenizer(example["sentence1"], example["sentence2"], truncation=True, padding=True) it runs for just around 4 minutes. I haven't noticed differences when running it on CPU, on Colab GPU non-padded even a bit faster, but not by much.

yiqianglee commented 2 years ago

@ip2016 For 1, set_memory_growth doesn't solve the HW limitation, currently ITEX's allocator will allocate almost all the memory of HW device, if you still see "OOM", I believe that hit the HW upbound, have you tried to lower batch size? For 2, un-pad sequence will cause dynamic shape for MatMul, it's a known issue that oneDNN primitive need to be re-created if different shapes are coming, you can double confirm by export DNNL_VERBOSE=2, if you see many "cache miss" from second iterations, that's the overhead (primitive creation), if this is the case in your side, then it's a known issue, we are evaluating if we can have some solutions internally, but currently, it's known issue.

intel / intel-extension-for-tensorflow

Low accuracy on Arc A380 #5