mehmedes commented 5 years ago

Description

Are there any plans on implementing mixed precision? There are already wrappers like here and here and in Tensorflow. Fp16 implementation in BERT.

cbockman commented 5 years ago

+1 on this.

(Note that there is bfloat16 support on TPU, but nothing on GPU, as you're probably asking for + I'd also love to play with, as well.)

Would be nice to be able to replicate (+extend) https://arxiv.org/abs/1806.00187. Although I realize the t2t team may be strictly focused on TPUs.

mehmedes commented 5 years ago

1362

etragas-fathom commented 5 years ago

Hey Mehmedes, I noticed you referenced the pull request we just put up. I just wanted to flag that our PR is only the first step in that direction; we're fairly confident that we've missed something in our implementation. Any thoughts you have would be welcome!

mehmedes commented 5 years ago

Dear etragas-fathom, To be honest I haven't been able to perform greater speed-ups than your implementation. It looks like people on Nvidia's end are also struggling with openseq2seq: https://github.com/NVIDIA/OpenSeq2Seq/issues/270 :_( Moreover, as you mentioned in your pull request batch size and number of gpus seem to have an impact: https://github.com/tensorflow/tensorflow/issues/5592 Did I get you right in the comments, when you run on 8 x V100 you're en par with fairseq?

etragas-fathom commented 5 years ago

Did I get you right in the comments, when you run on 8 x V100 you're en par with fairseq?

Only if we also enable the MirroredStrategy (the downstream implication of turning this flag on) https://github.com/tensorflow/tensor2tensor/blob/acde95f6cea575c1e5009d7a16d95545a23e0552/tensor2tensor/bin/t2t_trainer.py#L72 But we should expect to see a 3x boost from adding fp16 alone, whereas turning on MirroredStrategy results in using all reduce, which is a whole other beast

mehmedes commented 5 years ago

When running your implementation what does your nvidia-smi say? My Volta is underutilized and only running 60 - 70 % whereas cpu usage increases in contrast to fp32 computation? I had a small increase when using adafactor and activation_dtype=float16 + weight_dtype=float16, but actually not worth mentioning...

BTW, have you had a chance to take a look atdiet.py, which already seems to feature fp16 computation in T2T to some extent?

Moreover, in BERT the author of the fp16 implementation states they measured the throughput speed-up by increasing the batch size:

--use_fp16 allows batch_size to be increased from 8 to 16. With both optimizations enabled and batch_size increased to 16, throughput jumps from 7.81 to 30.32 examples/second, a nearly 4x performance boost.

It may be worth comparing the throughput of fp32 maximum batch size vs. the throughput of fp16 maximum batch size!

mehmedes commented 5 years ago

Ok. I see. Just ran a quick experiment on the summarization model.

Hardware: 2 * V100 SXM 32GB

ROBLEM=summarize_cnn_dailymail32k
MODEL=transformer
HPARAMS=transformer_tpu (activation_dtype=float16 separately added)

FP32	FP16	increase
max. batch size	8.192 * 2 = 16.384	16.384 * 2 = 32.768	*2
step/sec	2.6	1.95
batch size * step/sec	42.598	63.898	* 1,5

You're able to increase the batch size by 2 but the speed up is only 1.5.

cbockman commented 5 years ago

@mehmedes:

Did I get you right in the comments, when you run on 8 x V100 you're en par with fairseq?

Just to clarify, no we were/are not en par.

We get a steps/s increase of ~50% (the listed ~2 -> ~3). Fairseq reports throughput increase of 2.6x (and we verified this via running the pytorch repo).

Your comment about increasing the batch size and evaluating effective throughput w/ increased batch size is a good one; we didn't do so because we were trying to replicate the fairseq results precisely (they did test increasing batch size, but only a modest increase of ~40%--oddly, it gave an extremely modest throughput increase of ~3.5%).

Ok. I see. Just ran a quick experiment on the summarization model.

ROBLEM=summarize_cnn_dailymail32k MODEL=transformer HPARAMS=transformer_tpu

Hmm, I may be misunderstanding, but 1) did you run this on our branch, 2) why did you use transformer_tpu? (Did you separately add something to set activation_type = float16?)

I had a small increase when using adafactor and activation_dtype=float16 + weight_dtype=float16, but actually not worth mentioning...

Was this on the branch we linked? Note that we only did a small amount of experimentation with weight_dtype=float16, and it probably isn't properly supported on our branch.

mehmedes commented 5 years ago

O yes, I should have mentioned that. I used your branch and added activation_dtype=float16 to the tpu setting separately. When using adafactor it was possible to use weight_dtype=16 after changing return activation_dtype == tf.float16 and weight_dtype == tf.float32 to return activation_dtype == tf.float16 and weight_dtype == tf.float16 in def mixed_precision_is_enabled Your implementation should work I can't figure out why there's no speed up.

The float16 conversion seems to be working because maximum batch size can now be increased by double the size!

But I think there's some kind of another issue in general. My V100s are never utilized 100% in fp16 as well as fp32 training. It's mostly around 60-80%, even on single GPU training without parameter sharing.

cbockman commented 5 years ago

But I think there's some kind of another issue in general. My V100s are never utilized 100% in fp16 as well as fp32 training. It's mostly around 60-80%, even on single GPU training without parameter sharing.

If you're truly motivated, you could pick fairseq (https://github.com/pytorch/fairseq), press run with fp16, and check out the GPU utilization rates. It is possible those don't show more than 60-80%, anyway.

Separately, as a sanity check--are you sure the weights actually were/are fp16? (Without seeing your diff, I'm not 100% sure what you did.) I ask only because when we changed weights, we very rapidly saw the loss diverge (which is common behavior with straight-fp16, of course, without massaging). (We didn't spend any significant energy resolving this issue, however--always possible it was something trivial or a bug on our side.)

mehmedes commented 5 years ago

When running with adam my model diverged at the first step as you mentioned with activation and weight = fp16. With adafactor however it just seems fine. I used these params:


@registry.register_hparams
def transformer_tpu():
  """HParams for Transformer model on TPU."""
  hparams = transformer_base()
  update_hparams_for_tpu(hparams)
  hparams.activation_dtype = "float16"
  hparams.weight_dtype = "float16"
  return hparams

And changed your code in common_attention.py in L58 to:

return activation_dtype == tf.float16 and weight_dtype == tf.float16

mehmedes commented 5 years ago

With FP16 activation + weight = float16 and adafactor the max. batch size can be increased by factor 4 vs. FP32! The throughput increases to 85.197 (batch size * step/sec ) which is a speed up by factor 2

ROBLEM=summarize_cnn_dailymail32k
MODEL=transformer
HPARAMS=transformer_tpu (adafactor, activation_dtype + weight_dtype=float16)

adafactor + FP32	adafactor + FP16 activation	adafactor + FP16 activation + weight = float16
max. batch size	8.192 * 2 Volta = 16.384	16.384 * 2 Volta = 32.768	32.768 * 2 Volta = 65.536
step/sec	2.6	1.95	1.3
batch size * step/sec	42.598	63.898	85.197

cbockman commented 5 years ago

1) Note that our PR got merged to master.

2) Did you run to convergence? I.e., do you know that you were getting good loss/learning? I'm not sure what is going on in there w/ adafactor+float16 weights (maybe update clipping is somehow helping it survive?), but there aren't any results I'm familiar with that show comparable pure-fp16 accuracy (versus mixed).

(Highly open to citations, if you are familiar with good results...)

3) Inspired by your notes, we did test with simply running higher batch size. We go from ~8400 tokens/s (1x V100 16GB, full precision) to ~15840 token/s (mixed precision), which is a ~1.9x speed up, which is something of an improvement (at least on the assumption that LR can be adjusted to ensure 1.9x wall clock convergence).

4) Some possibility that doing something like https://github.com/pytorch/fairseq/commit/03a57decde62c76783ef7e2288bd61bc87f6e266 would further push throughput upwards of another ~20% (since would allow greater batchsize, and increased batch size seems to only minimally hit steps/s...at least for now).

mehmedes commented 5 years ago

Note that our PR got merged to master.

Yupp, already using that! Thank you!!

Did you run to convergence? I.e., do you know that you were getting good loss/learning? I'm not sure what is going on in there w/ adafactor+float16 weights (maybe update clipping is somehow helping it survive?), but there aren't any results I'm familiar with that show comparable pure-fp16 accuracy (versus mixed). (Highly open to citations, if you are familiar with good results...)

Me neither. Loss/learning actually look good, well, at least, the model doesn't diverge and the training loss and eval loss keeps steadily decreasing. T2T seems to be using it for their 1B params transformer. I know it's bfloat16 but still... https://github.com/tensorflow/tensor2tensor/blob/113bf535b3fd8ab32b0559fbc9aab7798e3dfd2e/tensor2tensor/models/transformer.py#L2397-L2410

Inspired by your notes, we did test with simply running higher batch size. We go from ~8400 tokens/s (1x V100 16GB, full precision) to ~15840 token/s (mixed precision), which is a ~1.9x speed up, which is something of an improvement (at least on the assumption that LR can be adjusted to ensure 1.9x wall clock convergence).

That's something, isn't it!

Some possibility that doing something like pytorch/fairseq@03a57de would further push throughput upwards of another ~20% (since would allow greater batchsize, and increased batch size seems to only minimally hit steps/s...at least for now).

That'd be great

mehmedes commented 5 years ago

Inspired by your notes, we did test with simply running higher batch size. We go from ~8400 tokens/s (1x V100 16GB, full precision) to ~15840 token/s (mixed precision), which is a ~1.9x speed up, which is something of an improvement (at least on the assumption that LR can be adjusted to ensure 1.9x wall clock convergence).

In openseq2seq they increased their transformer learning rate by 10: https://github.com/NVIDIA/OpenSeq2Seq/blob/fd35d1cfe53bbd5ed0b423c69c59fdfa6722968f/example_configs/text2text/en-de/transformer-big.py#L49

cbockman commented 5 years ago

In openseq2seq they increased their transformer learning rate by 10:

Hmm, the t2t code base has gone through a lot of iterations, but are you sure this is actually increased by 10? I have a vague recollection that this is what LR looked like on this side, too, before the t2t team changed how they manage LRs.

mehmedes commented 5 years ago

Yes, you're right. I just noticed the change. Thanks!

yangjunpro commented 5 years ago

Folks,

You may try with these PRs which have already been deployed inside Alibaba running for more than half a years, which significantly reducing the laborious work of manually converting fp32 models to fp16 version, which is both time-consuming and error-prone. Any feedback&comments are highly welcome. Auto-mixedprecision graph optimization pass and Mixed Precision Gradient Decorator

cbockman commented 5 years ago

Thank you @yangjunpro for highlighting this.

Do you see any documented performance performance advantages (accuracy or speed) vis-a-vis the code which has been pushed to date into t2t? The reason I ask is because the first PR in particular is a fairly heavy lift to use, since it requires re-compiling tensorflow.

yangjunpro commented 5 years ago

Thank you @yangjunpro for highlighting this.

Do you see any documented performance performance advantages (accuracy or speed) vis-a-vis the code which has been pushed to date into t2t? The reason I ask is because the first PR in particular is a fairly heavy lift to use, since it requires re-compiling tensorflow.

We have an in-house Tansformer-based NMT model which does get around 1.6X speed-up with the same convergency trend. Also we truly believe there is still room for performance improvement since auto-mixedprecision is a generic optimization solution, our first focus is about its generality to smoothly support diversified workload, and then we are currently working on further improving its performance.

sugeeth14 commented 5 years ago

Hi, If a model is trained on fp16 on Turing GPU and if I wanted to do inference on CPU is is possible ? As I see PyTorch doesn't support fp16 inference on CPU. Is it with the intel cpus that doesn't support fp16 inference or is it with framework like Tensorflow , PyTorch etc

lkluo commented 5 years ago

I used --hparams_set=transformer_fairseq_fp16_activation_big, and the training gets 1.5x speed-up. However, the training diverges after 8k steps. What was wrong with my setting?

tensorflow / tensor2tensor

[Feature Request] Mixed Precision #1221

Description

1362