Training time on different hardware

drvenabili commented 3 years ago

Hi!

Following on this https://github.com/ltgoslo/simple_elmo_training/issues/3#issuecomment-830621901, please find below training times for different hardware. Perhaps we could have some "expected training time" in the README?

For the record the models are trained on Alvis, Phase 1c.

Hardware:

2x Nvidia A100 40GB
16 cores (Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz)
Not sure how much RAM but enough

Software:

This repo (version #33a7e30c997b32ecea8f05b6d506c5c0505d01ac)
Python 3.7.4
GCC/8.3.0
CUDA/10.1.243
OpenMPI/3.1.4
TensorFlow/1.15.2-Python-3.7.4
h5py/2.10.0-Python-3.7.4
scikit-learn/0.21.3-Python-3.7.4

Parameters:

python bilm/train_elmo.py --train_prefix corpus/ --size 1015635151 --vocab_file data/vocab.txt, where
1. vocab.txt has 10,003 words
2. corpus/ contains 134 files, each containing 500k sentences
batch size = 384
dim = 2048
3 epochs

Run times and performance:

256372 seconds for 191200 batches (= 134.085s per 100 batches)
hard to estimate training time per epoch given the job did not finish, but more than 24h as the job was launched for 72
GPU usage at ~66% on average for compute, at 11% on average for memory

Logs are attached. It is now obvious I can really augment the batch size considerably, given the very low memory usage -- it was hard to estimate given nvidia-smi almost always reports 100%.

If you have other ideas I'm interested!

dcgm-gpu-stats-alvis2-21-jobid-67121.txt slurm-67121.txt

akutuzov commented 3 years ago

Thanks!

My first comment is that it is almost always sub-optimal to use TensorFlow 1 and CUDA 10 with A100 GPUs. A100s are intended to be used with CUDA 11 and cuDNN 8. Applications built with CUDA 10 and cuDNN 7 in principle can be run on A100 (by JIT-compiling PTX code into GPU binary code on every run), but this is not recommended by NVIDIA.

You can see this in your logs:

2021-04-30 19:35:10.327492: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_80' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.

This means your TensorFlow and CUDA do not know anything about GPUs with compute capability 8.0 (which is the level of Ampere A100 cards). TensorFlow includes PTX, that's why it works at all. But I suspect it has a heavy impact on performance (and may result in other errors in weird and unexpected locations).

I bet if you run the same code on V100 GPUs, it will be (much) faster.

drvenabili commented 3 years ago

Thanks! Indeed, it's not ideal but according to this https://www.tensorflow.org/install/source#gpu CUDA 11 requires TF 2.4, which isn't supported yet (https://github.com/ltgoslo/simple_elmo_training/commit/33a7e30c997b32ecea8f05b6d506c5c0505d01ac). Unless I'm mistaken?

I've currently re-launched with a bigger batch size on 4 A100s, but I should indeed try with the V100s. I will report in due time.

akutuzov commented 3 years ago

Yes, the training code in this repository currently does not support TF 2. The simple_elmo library for ELMo inference is agnostic to the TF version, but it does not do training.

Updating the training code to TF2 is in my plans, but I cannot give any specific date on when it will be implemented :)

drvenabili commented 3 years ago

Noted, thanks.

Updating the training code to TF2 is in my plans, but I cannot give any specific date on when it will be implemented :)

Please let me know if you need help testing etc.

drvenabili commented 3 years ago

Some update: training on two V100, same hyperparameters and data as above aside from a lower batch size (384) means a total training time of 210474 seconds -- 58.4 hours in total, or 19.4h per epoch.

Thanks!

akutuzov commented 3 years ago

What is the training corpus size in word tokens?

drvenabili commented 3 years ago

Roughly 1B: 1,015,635,151

akutuzov commented 3 years ago

OK, this is in line with my estimates. I guess this issue can be closed then? :)

drvenabili commented 3 years ago

I guess this issue can be closed then? :)

Sure. I was wondering if you wanted to have something in the README.md with training times? Something like a table with hyperparameters and GPUs, to give newcomers an idea of expected training times. Let me know and I can submit a PR

akutuzov commented 3 years ago

Yes, we can do that. My rule of thumb is "one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs and batch size 192", but we can also add your experiences. PRs are most certainly welcome! :)

ltgoslo / simple_elmo_training

Training time on different hardware #6