Open ip2016 opened 2 years ago
@ip2016 thanks for reporting this issue. We will have a look, but first try in other Intel GPU, we can't reproduce this issue, will try on A380 also.
@ip2016 May I ask what kind of CPU you’re using? We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.
@ip2016 May I ask what kind of CPU you’re using? We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.
Hello @Tengfei09
Thanks for your response. Here is my system info:
>oneapi-cli version
v0.2.0-4-g9fef7bf786
>glxinfo -B
name of display: :0
hwconfig key 77 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 78 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 79 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 80 (UNKNOWN_INTEL_HWCONFIG) unhandled!
display: :0 screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Intel (0x8086)
Device: Mesa Intel(R) Graphics (DG2) (0x56a5)
Version: 22.2.0
Accelerated: yes
Video memory: 6088MB
Unified memory: yes
Preferred profile: core (0x1)
Max core profile version: 4.6
Max compat profile version: 4.6
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) Graphics (DG2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.0-devel (git-44289c46d9)
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
>lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
CPU family: 6
Model: 165
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 5
CPU max MHz: 4300.0000
CPU min MHz: 800.0000
BogoMIPS: 5799.77
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx p
dpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 mo
nitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 1.5 MiB (6 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Srbds: Mitigation; Microcode
Tsx async abort: Not affected
>lspci -v |grep -A8 VGA
03:00.0 VGA compatible controller: Intel Corporation Device 56a5 (rev 05) (prog-if 00 [VGA controller])
Subsystem: ASRock Incorporation Device 6004
Flags: bus master, fast devsel, latency 0, IRQ 144, IOMMU group 1
Memory at a1000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at a2000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: i915
Kernel modules: i915
I did some additional testing, still getting incompatible results sometimes. Here is one which is reproducible:
import tensorflow as tf
tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]
tf.random.set_seed(11)
c = tf.random.normal([2, 2], 0, 1, tf.float32)
d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")
[[-1.5229468 0.66954553]
[-0.64246905 1.4300431 ]], [[ 0.35981855 1.018044 ]
[-2.029798 -0.7807023 ]]
with tf.device("/XPU:0"):
print(f"XPU Result: {tf.tensordot(c,d,2)}")
with tf.device("/CPU:0"):
print(f"CPU Result: {tf.tensordot(c,d,2)}")
XPU Result: 0.32128676772117615
CPU Result: 0.321286678314209
Running the same code with tensorflow-cpu, I'm getting:
CPU Result: 0.32128584384918213
Virtual environment version:
>python --version
Python 3.10.6
>pip freeze
absl-py==1.3.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.1.0
astunparse==1.6.3
attrs==22.1.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
colorama==0.4.6
contourpy==1.0.5
cycler==0.11.0
Cython==0.29.32
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
dm-tree==0.1.7
entrypoints==0.4
etils==0.9.0
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.8.0
flatbuffers==22.10.26
fonttools==4.38.0
gast==0.4.0
gin-config==0.5.0
google-api-core==2.10.2
google-api-python-client==2.65.0
google-auth==2.13.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.4
grpcio==1.50.0
h5py==3.7.0
httplib2==0.21.0
huggingface-hub==0.10.1
idna==3.4
importlib-resources==5.10.0
intel-extension-for-tensorflow==1.0.0
intel-extension-for-tensorflow-lib==1.0.0.1
ipykernel==6.17.0
ipython==8.6.0
ipython-genutils==0.2.0
ipywidgets==8.0.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.16.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-server==1.21.0
jupyter_client==7.4.4
jupyter_core==4.11.2
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.3
kaggle==1.5.12
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
libclang==14.0.6
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.6.1
matplotlib-inline==0.1.6
mistune==2.0.4
nbclassic==0.4.5
nbclient==0.7.0
nbconvert==7.2.3
nbformat==5.7.0
nest-asyncio==1.5.6
notebook==6.5.1
notebook_shim==0.2.0
numpy==1.23.4
oauth2client==4.1.3
oauthlib==3.2.2
opencv-python-headless==4.6.0.66
opt-einsum==3.3.0
packaging==21.3
panda==0.3.1
pandas==1.5.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.3.0
portalocker==2.6.0
prometheus-client==0.15.0
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.19.6
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.5
pycparser==2.21
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
python-slugify==6.1.2
pytz==2022.5
PyYAML==6.0
pyzmq==24.0.1
qtconsole==5.3.2
QtPy==2.2.1
regex==2022.9.13
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
sacrebleu==2.3.1
scikit-learn==1.1.3
scipy==1.9.3
Send2Trash==1.8.0
sentencepiece==0.1.97
seqeval==1.2.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
stack-data==0.6.0
tabulate==0.9.0
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.0
tensorflow-addons==0.18.0
tensorflow-datasets==4.7.0
tensorflow-estimator==2.10.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.10.0
tensorflow-model-optimization==0.7.3
tensorflow-text==2.10.0
termcolor==2.0.1
terminado==0.17.0
text-unidecode==1.3
tf-models-official==2.7.0
tf-slim==1.1.0
threadpoolctl==3.1.0
tinycss2==1.2.1
tokenizers==0.13.1
toml==0.10.2
tornado==6.2
tqdm==4.64.1
traitlets==5.5.0
transformers==4.23.1
typeguard==2.13.3
typing_extensions==4.4.0
uritemplate==4.1.1
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
Werkzeug==2.2.2
widgetsnbextension==4.0.3
wrapt==1.14.1
zipp==3.10.0
I changed code a bit and now CPU result from the tensorflow intel plugin matches the result from tensorflow-cpu, but not the GPU result.
with tf.device("/CPU:0"):
tf.random.set_seed(11)
c = tf.random.normal([2, 2], 0, 1, tf.float32)
d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")
[[-1.5229472 0.66954476]
[-0.6424697 1.4300429 ]], [[ 0.35981855 1.0180439 ]
[-2.0297976 -0.7807032 ]]
with tf.device("/XPU:0"):
print(f"XPU Result: {tf.tensordot(c,d,2)}")
with tf.device("/CPU:0"):
print(f"CPU Result: {tf.tensordot(c,d,2)}")
XPU Result: 0.3212856948375702
CPU Result: 0.32128584384918213
@ip2016 , good to see your latest result. For float point, I think this is reasonable, we can't expect bit-by-bit same in float point arithmetic, normally, we use relative tolerance and absolute tolerance to compare float point, here the tolerance is less than 1e-6 which is reasonable to me.
XPU Result: 0.3212856948375702 CPU Result: 0.32128584384918213
@yiqianglee Thanks for your input. The issue I'm facing is a NaN in loss function when I'm trying a simple project with BERT fine tuning. I suspect this is caused by "exploded gradient problem" due to accumulated/amplified accuracy error on Arc A380.
This is how it runs on CPU: (in progress)
Epoch 1/2
30/459 [>.............................] - ETA: 7:28 - loss: 0.6868 - accuracy: 0.6208
And this is a XPU run (in progress):
Epoch 1/2
2022-11-01 10:23:42.250718: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type XPU is enabled.
30/459 [>.............................] - ETA: 20:13 - loss: nan - accuracy: 0.7000
I also notices it runs much slower on GPU.
The "train" code is below (a simple example from huggingface):
# %%
import tensorflow as tf
print(tf.config.list_physical_devices())
# %%
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["labels"],
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["labels"],
shuffle=False,
collate_fn=data_collator,
batch_size=8,
)
# %%
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# %%
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(
optimizer="adam",
loss=loss,
metrics=["accuracy"],
)
model.fit(
tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=2
)
@ip2016 Thanks for reporting, we can reproduce it now, working on fix.
Thanks for the fast fix. After some unsuccessful attempts, I was able to build the package. I'm not getting NaN in loss function anymore. However, loss value seems to be a bit optimistic most of the time. For the example above I'm getting loss ~0.38 while for CPU and GPU (on google Colab) I'm getting around 0.65
I have 2 more issues that I'm not sure if these are bugs or limitations.
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
to:
return tokenizer(example["sentence1"], example["sentence2"], truncation=True, padding=True)
it runs for just around 4 minutes.
I haven't noticed differences when running it on CPU, on Colab GPU non-padded even a bit faster, but not by much.@ip2016
For 1, set_memory_growth
doesn't solve the HW limitation, currently ITEX's allocator will allocate almost all the memory of HW device, if you still see "OOM", I believe that hit the HW upbound, have you tried to lower batch size?
For 2, un-pad sequence will cause dynamic shape for MatMul, it's a known issue that oneDNN primitive need to be re-created if different shapes are coming, you can double confirm by export DNNL_VERBOSE=2
, if you see many "cache miss" from second iterations, that's the overhead (primitive creation), if this is the case in your side, then it's a known issue, we are evaluating if we can have some solutions internally, but currently, it's known issue.
It seems like XPU calculation accuracy deteriorates in 4th-5th digit after the dot on common math operation. Here is the sample code:
Which yields the following results:
System: Asrock A380 Ubuntu 22.04 (kernel 5.17.0-1019-oem)