Closed Scot-Survivor closed 7 months ago
@Scot-Survivor thanks, From your gavin-small-err.txt log, I found a lot of kernel run on XPU, do you meet performance issue on it? Or this issue can be resolved?
There is a fair few, from my counts there's 206 on XPU and just over 10k on CPU (I will re check this in a second)
However what concerns me, is operations (well I know it's an algorithm but for simplicity's sake we'll say it's one operation) such Softmax, are being placed on the CPU, which seems wrong to me? I don't see why there wouldn't be a kernel written for such a common set of operations.
I can reproduce this issue on A700's cards too, not just data centre cards.
@Scot-Survivor can you try to get log with "export TF_CPP_MAX_VLOG_LEVEL=1", this will report the kernel execution device
Ahh, okay got you, I'll do that now, give me 5 minutes
This is for comparison reference (before doing the environment variable mentioned above) gavin-small-stdout.txt gavin-small-err.txt (This was very verbose so this is only the first 22MB of a 1.3GB file of logs)
This is still reproducible on A700 series also. (with the downgraded versions)
This is for comparison reference (before doing the environment variable mentioned above) gavin-small-stdout.txt gavin-small-err.txt (This was very verbose so this is only the first 22MB of a 1.3GB file of logs)
This image shows the number of XPU versus CPU placements, this was much better, at 7% instead of 0.6% however crucial operations such as softmax and gradient tape were still left on the CPU, which is significantly limiting the model's training ability.
Can you help to share how to reproduce this issue? Thanks.
Can you help to share how to reproduce this issue? Thanks.
For specifically my issue you'd need to clone my repo, however, that's painful, so just try to train a TensorFlow model on an Arc card or intel Data centre max card, and log the placements of where the operations are going.
Specifically, check algorithms such as Gradient Tape & Softmax. These two are key to AI and I would expect that kernels have been written to have the operations these involve, run on accelerators.
I am using exactly: TensorFlow 2.12.0 and Keras 2.12
My code: https://github.com/Gavin-Development/GavinTraining, relies heavily on Keras layers and API if that's at all useful to know
@Scot-Survivor our code should have softmax kernel support, so can you try with some simple case to check your env?
I'll try with tensorflow MNSIT
MNIST is placing it on XPU, in fact in my counts plenty of ops are running on the XPU. This is using the exact same conda env with the same env variables set at my other script.
So the question is, why is: https://github.com/Gavin-Development/GavinTraining/blob/master/train.py not placing the OPs on the XPU? mnist-err.txt
I even tried forcing it with with tf.device("/xpu:0"), and it would still not use the XPUs any more.
this shows about 60% of operations on XPU for the MNIST, which is good, what we would expect.
Do you know of a command to show utilisation of the intel GPUs?
Do you use mixed precision in your script? Please try to use bfloat16 instead of float16.
Do you use mixed precision in your script? Please try to use bfloat16 instead of float16.
No I left mixed precision off in this case, I run my script with these commands:
python3 train.py -dp ../ -dn Tokenizer-3 --model-name BigModel --max-samples=1_000_000 --batch-size=128 --tokenizer-file Tokenizer-3 --epochs 20 --max-seq-length=128 --layers=2 --d-model=128 --heads=4 --dff=512 --dropout=0.01 --save-every 1000
So a pretty small model too
Showing stats with xpu-smi
, show's that on card 1 memory usage is 86%, which I think is good news, as that must be my tensorflow process, but its still weird that many ops are still running on the CPU
@Scot-Survivor do you have NV GPU for double check your model script as the mnist show good result.
Yeah it's trained fine on 2060 at home, 3080, 3090, 4090, and A100s in the cloud. It places those ops on the GPU fine. I'm not sure what's going on.
can you help to share a log from NV GPU with TF2.12 and with "export TF_CPP_MAX_VLOG_LEVEL=1"
@Scot-Survivor How the status now? Could you get the tranining on XPU now?
No, I never managed to get it working.
I'll be testing again soon though, so I'll be sure to give it another go.
I often just wonder if it's my tensorflow code, but I am not doing anything too out there.
@Scot-Survivor Would you please give a simple example to reproduce this issue? So that we can reprodue and investigae it at our side.
Thanks!
@Scot-Survivor Would you please give a simple example to reproduce this issue? So that we can reprodue and investigae it at our side.
Thanks!
I'll see if I can get a basic example that reproduces this in the next few days. I'll also re-test the code to see if it happens to be fixed on latest updates :)
@Scot-Survivor , Just FYI,
we upgrade to TF 2.14 version and provided example to run distributed training on Intel GPU Intel® Optimization for Horovod* is available for tensorflow to run the distributed workload faster and easier.
Example link - https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/train_horovod
and
publish the video to show how to conduct train a CNN Model on Multiple Arc™ GPUs(https://www.youtube.com/watch?v=G5JpBpO6o74)
I am currently trying to train: https://github.com/Gavin-Development/ On 4x Intel Data Centre Max card. However, I am having major issues trying to get TensorFlow to actually use these cards. Enabling placement debug shows that only 0.6% of all operations are being placed on the XPU (and only one of them at that)
Debug printing:
This tells me that TensorFlow sees: CPU, XPU:0, XPU:1, XPU:2, XPU:3 However, as described above TensorFlow is not using the XPUs properly. (Trying to fix this is how I discovered: https://github.com/intel/intel-extension-for-tensorflow/issues/41)