intel / intel-extension-for-tensorflow

Intel® Extension for TensorFlow*
Other
323 stars 40 forks source link

Tensorflow not placing operations on XPU #42

Closed Scot-Survivor closed 7 months ago

Scot-Survivor commented 1 year ago

I am currently trying to train: https://github.com/Gavin-Development/ On 4x Intel Data Centre Max card. However, I am having major issues trying to get TensorFlow to actually use these cards. Enabling placement debug shows that only 0.6% of all operations are being placed on the XPU (and only one of them at that)

Debug printing:

tf.config.get_physical_devices()

This tells me that TensorFlow sees: CPU, XPU:0, XPU:1, XPU:2, XPU:3 However, as described above TensorFlow is not using the XPUs properly. (Trying to fix this is how I discovered: https://github.com/intel/intel-extension-for-tensorflow/issues/41)

guizili0 commented 1 year ago
  1. Can you help to use tensorflow 2.12 like I said in issue 41
  2. Can you help to share your debug log if you still have issue after using TF 2.12?
Scot-Survivor commented 1 year ago
  1. I have downgraded
  2. Logs are attached gavin-small-err.txt gavin-small-stdout.txt
guizili0 commented 1 year ago

@Scot-Survivor thanks, From your gavin-small-err.txt log, I found a lot of kernel run on XPU, do you meet performance issue on it? Or this issue can be resolved?

Scot-Survivor commented 1 year ago

There is a fair few, from my counts there's 206 on XPU and just over 10k on CPU (I will re check this in a second)

However what concerns me, is operations (well I know it's an algorithm but for simplicity's sake we'll say it's one operation) such Softmax, are being placed on the CPU, which seems wrong to me? I don't see why there wouldn't be a kernel written for such a common set of operations.

Scot-Survivor commented 1 year ago

I can reproduce this issue on A700's cards too, not just data centre cards.

guizili0 commented 1 year ago

@Scot-Survivor can you try to get log with "export TF_CPP_MAX_VLOG_LEVEL=1", this will report the kernel execution device

Scot-Survivor commented 1 year ago

Ahh, okay got you, I'll do that now, give me 5 minutes

Scot-Survivor commented 1 year ago

image This is for comparison reference (before doing the environment variable mentioned above) gavin-small-stdout.txt gavin-small-err.txt (This was very verbose so this is only the first 22MB of a 1.3GB file of logs)

Scot-Survivor commented 1 year ago

This is still reproducible on A700 series also. (with the downgraded versions)

Scot-Survivor commented 1 year ago

image This is for comparison reference (before doing the environment variable mentioned above) gavin-small-stdout.txt gavin-small-err.txt (This was very verbose so this is only the first 22MB of a 1.3GB file of logs)

![image](https://github.com/intel/intel-extension-for-tensorflow/assets/40865296/7753746b-e537-42ca-a1eb-a6b97e49205b

This image shows the number of XPU versus CPU placements, this was much better, at 7% instead of 0.6% however crucial operations such as softmax and gradient tape were still left on the CPU, which is significantly limiting the model's training ability.

guizili0 commented 1 year ago

Can you help to share how to reproduce this issue? Thanks.

Scot-Survivor commented 1 year ago

Can you help to share how to reproduce this issue? Thanks.

For specifically my issue you'd need to clone my repo, however, that's painful, so just try to train a TensorFlow model on an Arc card or intel Data centre max card, and log the placements of where the operations are going.

Specifically, check algorithms such as Gradient Tape & Softmax. These two are key to AI and I would expect that kernels have been written to have the operations these involve, run on accelerators.

I am using exactly: TensorFlow 2.12.0 and Keras 2.12

My code: https://github.com/Gavin-Development/GavinTraining, relies heavily on Keras layers and API if that's at all useful to know

guizili0 commented 1 year ago

@Scot-Survivor our code should have softmax kernel support, so can you try with some simple case to check your env?

Scot-Survivor commented 1 year ago

I'll try with tensorflow MNSIT

Scot-Survivor commented 1 year ago

MNIST is placing it on XPU, in fact in my counts plenty of ops are running on the XPU. This is using the exact same conda env with the same env variables set at my other script.

So the question is, why is: https://github.com/Gavin-Development/GavinTraining/blob/master/train.py not placing the OPs on the XPU? mnist-err.txt

I even tried forcing it with with tf.device("/xpu:0"), and it would still not use the XPUs any more.

Scot-Survivor commented 1 year ago

image this shows about 60% of operations on XPU for the MNIST, which is good, what we would expect.

Do you know of a command to show utilisation of the intel GPUs?

guizili0 commented 1 year ago

Do you use mixed precision in your script? Please try to use bfloat16 instead of float16.

Scot-Survivor commented 1 year ago

Do you use mixed precision in your script? Please try to use bfloat16 instead of float16.

No I left mixed precision off in this case, I run my script with these commands:

python3 train.py -dp ../ -dn Tokenizer-3 --model-name BigModel --max-samples=1_000_000 --batch-size=128 --tokenizer-file Tokenizer-3 --epochs 20 --max-seq-length=128 --layers=2 --d-model=128 --heads=4 --dff=512 --dropout=0.01 --save-every 1000

So a pretty small model too

Scot-Survivor commented 1 year ago

Showing stats with xpu-smi, show's that on card 1 memory usage is 86%, which I think is good news, as that must be my tensorflow process, but its still weird that many ops are still running on the CPU

guizili0 commented 1 year ago

@Scot-Survivor do you have NV GPU for double check your model script as the mnist show good result.

Scot-Survivor commented 1 year ago

Yeah it's trained fine on 2060 at home, 3080, 3090, 4090, and A100s in the cloud. It places those ops on the GPU fine. I'm not sure what's going on.

guizili0 commented 1 year ago

can you help to share a log from NV GPU with TF2.12 and with "export TF_CPP_MAX_VLOG_LEVEL=1"

xiguiw commented 10 months ago

@Scot-Survivor How the status now? Could you get the tranining on XPU now?

Scot-Survivor commented 10 months ago

No, I never managed to get it working.

I'll be testing again soon though, so I'll be sure to give it another go.

Scot-Survivor commented 10 months ago

I often just wonder if it's my tensorflow code, but I am not doing anything too out there.

xiguiw commented 10 months ago

@Scot-Survivor Would you please give a simple example to reproduce this issue? So that we can reprodue and investigae it at our side.

Thanks!

Scot-Survivor commented 10 months ago

@Scot-Survivor Would you please give a simple example to reproduce this issue? So that we can reprodue and investigae it at our side.

Thanks!

I'll see if I can get a basic example that reproduces this in the next few days. I'll also re-test the code to see if it happens to be fixed on latest updates :)

yinghu5 commented 9 months ago

@Scot-Survivor , Just FYI,
we upgrade to TF 2.14 version and provided example to run distributed training on Intel GPU Intel® Optimization for Horovod* is available for tensorflow to run the distributed workload faster and easier. Example link - https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/train_horovod and publish the video to show how to conduct train a CNN Model on Multiple Arc™ GPUs(https://www.youtube.com/watch?v=G5JpBpO6o74)