stickeritis / sticker

Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot
Other
25 stars 2 forks source link

Add support for automatic mixed precision graph rewrites #154

Closed danieldk closed 4 years ago

danieldk commented 4 years ago

This PR consists of three commits:

I did not have the opportunity to test this PR yet. Both GPUs on hopper are used. I am currently compiling Tensorflow on tesniere with the right compute capabilities.

twuebi commented 4 years ago

gpu 1 on hopper is free

danieldk commented 4 years ago

Inspecting the output, it seems to work correctly. @twuebi do you see a big difference in performance? For the default transformer it seems to be a few seconds faster per epoch with 1.14.0.

I briefly tried with the precompiled version of Tensorflow 1.15.0, it compiled more nodes, but no big improvement over 1.14.0.

twuebi commented 4 years ago

Besides the difference between default with/without amp (1:49/50 vs. 1:56/57) I don't have any points for comparison. You could compare larger networks to see whether the difference gets more significant.

E.g. --activation relu --outer_hsize 384 --inner_hsize 4092 --keep_prob_inner 0.7 --keep_prob_outer 0.8 --keep_prob_attention 0.8 --keep_prob_input 0.9 --num_layers 6 --num_heads 8

It's also possible that it allows larger models to fit into memory.

danieldk commented 4 years ago

Ah, you mentioned doing pretraining with mixed precision. I thought that maybe you had tried without as well to compare the ETAs.

twuebi commented 4 years ago

I briefly tested pre-training without amp. The default config has similar gains as with the usual train (2:00m/epoch -> 1:50m/epoch; 9:40h/epoch -> 9:00h/epoch). Didn't try bigger ones without amp yet, also didn't do any profiling yet.

twuebi commented 4 years ago

I didn't observe differences in accuracy between using AMP vs using no AMP.