tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
183.83k stars 74.03k forks source link

Training time for models #70247

Open ThorvaldAagaard opened 1 week ago

ThorvaldAagaard commented 1 week ago

Issue type

Performance

Have you reproduced the bug with TensorFlow Nightly?

N/A

Source

binary

TensorFlow version

2.13

Custom code

N/A

OS platform and distribution

x64

Python version

3.9

Current behavior?

I have a model created using TF 1.x, that can be trained with a given dataset in about 8 hours.

6 months ago I converted the model to Keras 2, and noted that the training time now suddenly was more than 16 hours, so I ended using the older 1.X model, but used TF 2.13 with support for 1.X.

Now I have upgraded the computer used for training to a i9-14900K with 24 cores (from i9-9900K with 8 cores)

I expected a much faster training, but actually the training is taking about 50% more.

I noticed that my calculations of error during training went down much faster Old model image

New model image

So instead of being concerned with longer training time should I be happy that I can reduce iterations?

Now I am planning to upgrade to Keras 3, and perhaps it will be better.

Standalone code to reproduce the issue

If this is expected behavior, where can I find documentation of this?

Relevant log output

No response

sushreebarsa commented 4 days ago

@ThorvaldAagaard Please ensure that your environment and TensorFlow version are fully optimized for your new hardware. Also kindly check if TensorFlow is correctly utilizing all available cores and try with the latest TF version? For any further queries please post this issue in Keras repository.

Thank you!

ThorvaldAagaard commented 4 days ago

Thank you for taking time to respond.

I am currently not using latest version (2.16) as my scripts are not compatible.

I am using TF 2.13, and my models are still TF 1.X. Should I still make my post on https://github.com/keras-team/keras/issues ?

Since my post I have made some progress, using WSL2 and adding TF_ENABLE_ONEDNN_OPTS=1 to my environment, so I am training faster than on the old platform as each iteration now is down to 2.5 minutes :-)

So to show some examples:

1: Running on i9-9900K @ 3.6 GHz with 64GB Ram - windows 11 CPU: image

2024-06-27 12:20:47 10000. c_train=0.15416838228702545
2024-06-27 12:24:20 20000. c_train=0.13370221853256226
2024-06-27 12:27:55 30000. c_train=0.10825465619564056

This is what I have been used to - a little less than 4 mins for each iteration

2: Running on RTX 2070 SUPER - Windows 11 GPU image

And I note that this also use about 25% of the CPU and about the same 30GB of Ram

2024-06-27 12:33:53 10000. c_train=0.2343929558992386
2024-06-27 12:37:44 20000. c_train=0.18104678392410278
2024-06-27 12:41:15 30000. c_train=0.1590864360332489

So a little bit slower but OK - allows me to train 2 models at the same time without the CPU version suffering to much.

3: Running on i9-14900K with 96GB Ram - windows 11 CPU: image

2024-06-27 13:36:55 10000. c_train=0.150325208902359
2024-06-27 13:39:14 20000. c_train=0.11710051447153091
2024-06-27 13:41:37 30000. c_train=0.09899365156888962

After installation that was the slow training in the original post. Adding TF_ENABLE_ONEDNN_OPTS have changed it considerable.

This is interesting as the thruput is better, but the utilization of the CPU is only 66%

I hope that a document like https://www.intel.com/content/www/us/en/developer/articles/technical/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html can help me getting to 100% utilization

4: Running on RTX 4090 - WSL2 GPU image


2024-06-27 14:43:07 10000. c_train=0.2535703182220459
2024-06-27 14:47:09 20000. c_train=0.19858309626579285
2024-06-27 14:51:09 30000. c_train=0.16876137256622314```
So slower than the 2070 

5: Running on RTX 4090 - Windows 11 GPU
Not yet tested

6: Running on i9-14900K with 96GB Ram - WSL2 CPU:
Not yet tested

So if you based on this has any recommendations like finding link to: TensorFlow version are fully optimized for your new hardware 
please advice, and if not, then just close this as solved.
sushreebarsa commented 3 days ago

@ThorvaldAagaard Thank you for the update! TF v1.x is not actively supported and for migration please check this reference. It is highly recommended to use the latest. If you still want to use TF v1.x then kindly post this issue in TF forum where there is a larger community to get you the right help. Thank you!

ThorvaldAagaard commented 3 days ago

I have tried upgrading to Keras, but notice a decrease in performance.

The real issue is that I only get 50-60 % utilization of my CPU, and I have searched for a solution but without luck.

From my search it seems to be a problem I still will have after switching to Keras.

I can see, that if I train 2 models at the same time, utilization will get to 75%, but still a distance to the expected 100%

sushreebarsa commented 11 hours ago

@ThorvaldAagaard Could you ensure that data loading and preprocessing are not bottlenecks. You could use multi-threaded data loading and preprocessing with tf.data.Dataset. Please use Prefetch and parallelize the data pipeline as much as possible. Also try experimenting with different batch sizes and let us know? Thank you!

ThorvaldAagaard commented 4 hours ago

Remember this is still TF 1.X, so dataloading is done by using a batcher https://github.com/lorserker/ben/blob/main/src/batcher.py It could be a bottle neck, but fetching data is way below a millisecond for an iteration.

Increasing batch size seems to lower the total spent time, but even with a very large batch size CPU-usage still is around 50%

I will try different batch sizes, but also LSTM_size and hidden layers have impact on performance.

I have been searching for some best practice, but did not really find anything

My input is 1 mill entries, where each entry is sequence of 8, and the output is then 1 mill * 8 1 hot array.

We started with LSTM 128 and 3 hidden laters Batch size 64 and 50 epocs (Giving about 3 mill iterations) Learning rate 0.0005

Changing batch size to 512 is reducing iterations to 1/8 but iteration time is only going up by factor 4, so reducing execution time to the half.

But it seems to lower the quality of the net.

Increasing LSTM to 256 and even 512 seems to improve the net.

So if there is any guidelines (except the trial and error approach) it will be wellcome.

I am trying to upgrade to Keras3, but unfortunately, not that simple, but it looks like the way forward.