Closed drvenabili closed 3 years ago
Thanks!
My first comment is that it is almost always sub-optimal to use TensorFlow 1 and CUDA 10 with A100 GPUs. A100s are intended to be used with CUDA 11 and cuDNN 8. Applications built with CUDA 10 and cuDNN 7 in principle can be run on A100 (by JIT-compiling PTX code into GPU binary code on every run), but this is not recommended by NVIDIA.
You can see this in your logs:
2021-04-30 19:35:10.327492: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_80' is not defined for option 'gpu-name'
Relying on driver to perform ptx compilation. This message will be only logged once.
This means your TensorFlow and CUDA do not know anything about GPUs with compute capability 8.0 (which is the level of Ampere A100 cards). TensorFlow includes PTX, that's why it works at all. But I suspect it has a heavy impact on performance (and may result in other errors in weird and unexpected locations).
I bet if you run the same code on V100 GPUs, it will be (much) faster.
Thanks! Indeed, it's not ideal but according to this https://www.tensorflow.org/install/source#gpu CUDA 11 requires TF 2.4, which isn't supported yet (https://github.com/ltgoslo/simple_elmo_training/commit/33a7e30c997b32ecea8f05b6d506c5c0505d01ac). Unless I'm mistaken?
I've currently re-launched with a bigger batch size on 4 A100s, but I should indeed try with the V100s. I will report in due time.
Yes, the training code in this repository currently does not support TF 2. The simple_elmo library for ELMo inference is agnostic to the TF version, but it does not do training.
Updating the training code to TF2 is in my plans, but I cannot give any specific date on when it will be implemented :)
Noted, thanks.
Updating the training code to TF2 is in my plans, but I cannot give any specific date on when it will be implemented :)
Please let me know if you need help testing etc.
Some update: training on two V100, same hyperparameters and data as above aside from a lower batch size (384) means a total training time of 210474 seconds -- 58.4 hours in total, or 19.4h per epoch.
Thanks!
What is the training corpus size in word tokens?
Roughly 1B: 1,015,635,151
OK, this is in line with my estimates. I guess this issue can be closed then? :)
I guess this issue can be closed then? :)
Sure. I was wondering if you wanted to have something in the README.md with training times? Something like a table with hyperparameters and GPUs, to give newcomers an idea of expected training times. Let me know and I can submit a PR
Yes, we can do that. My rule of thumb is "one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs and batch size 192", but we can also add your experiences. PRs are most certainly welcome! :)
Hi!
Following on this https://github.com/ltgoslo/simple_elmo_training/issues/3#issuecomment-830621901, please find below training times for different hardware. Perhaps we could have some "expected training time" in the README?
For the record the models are trained on Alvis, Phase 1c.
Hardware:
Software:
Parameters:
python bilm/train_elmo.py --train_prefix corpus/ --size 1015635151 --vocab_file data/vocab.txt
, wherevocab.txt
has 10,003 wordscorpus/
contains 134 files, each containing 500k sentencesRun times and performance:
Logs are attached. It is now obvious I can really augment the batch size considerably, given the very low memory usage -- it was hard to estimate given
nvidia-smi
almost always reports 100%.If you have other ideas I'm interested!
dcgm-gpu-stats-alvis2-21-jobid-67121.txt slurm-67121.txt