pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 479 forks source link

Mixed precision training #1330

Closed mgrankin closed 4 years ago

mgrankin commented 4 years ago

Any plans to introduce mixed precision training as an analog to opt_level O2 in Nvidia/Apex? I'm training GPT-2 model right now. It's not training well with XLA_USE_BF16=1. I can get perplexity 120 with BF16 and perplexity 65 with full precision (355M model). If I'm training on GPU I see no difference in perplexity in mixed precision mode vs full precision.

One more problem with full precision - the big model (774M) can't be trained in full precision at all because there is not enough memory in TPU v3-8. I can fit it in BF16 mode but it's not training well as I said earlier.

dlibenzi commented 4 years ago

I think the issue for us is that bfloat16 support on pytorch CPU is incomplete, so even though we can map our BF16 to pytorch BFloat16 type, we might get into trouble if we had an unsupported pytorch op which requires us to go to CPU. I will try to wire it up and see what happens.

dlibenzi commented 4 years ago

https://github.com/pytorch/xla/pull/1331

Keep in mind that you will find issues in PT when using bfloat16. Example, this does not work:

t = torch.randn(2, 2, dtype=torch.bfloat16)

Gives:

RuntimeError: _th_normal_ not supported on CPUType for BFloat16

You could get around issuing a randn() with float32 and doing a to(bfloat16) but if you plan to use existing models or layers implementations, you might be looking at a lot of changes.

The above PR just raises our bar so that when PT is bfloat16-ready, we should be as well.

mgrankin commented 4 years ago

I have no issue with PyTorch CPU BFloat16 support. I don't get any error like you mentioned above.

My model training poorly on BFloat16. The reason (my guess) is some layers should not be trained on 16bit mode. Like batchnorm layer for instance, this layer should be kept at 32. Nvidia/Apex does this automatically using the O2 mode. Should we expect such mode for PyTorch/XLA?

dlibenzi commented 4 years ago

Strange you do not get that error. I am trying on PyTorch HEAD, and I get that. Maybe you are on official PT 1.3 and things got broken in the recent commits?

As far as PyTorch/XLA, it uses the types that are coming down from PT, aside from the global use-bf16 flag (which is there only because PT did not have bfloat16 before). Making decisions for the user an a layer by layer case does not seem like a good policy. It should be the upper level model builder that tells the lower SW layer which types tensors should be, because if we bolt a behavior in the lower layers, then it is hard to undo for users which do not want to subscribe to it.

mgrankin commented 4 years ago

Deciding for the user which layers should be with wich precision is exactly what AMP library does in O2 mode. As a practitioner I use this feature all the time. There is simply no reason not to - it one line of code and your model is training much faster with less memory and same accuracy.

https://nvidia.github.io/apex/amp.html

dlibenzi commented 4 years ago

That is exactly what user policy means. The user have to modify their code to inject an "alien" component like amp, to whom the handling of precision is explicitly delegated. Not that a vanilla PyTorch model+loss+optimizer precision characteristics are silently changed from underneath him by a backend.

mgrankin commented 4 years ago

Thank you for clarifying. So automatic mixed precision is out of the scope for this project.

Is there any example how non-automatic mixed precision should be done? Like how I convert one existing layer to bf16 while keeping others on 32bit precision?

dlibenzi commented 4 years ago

I don't know the details of AMP, but from the way it is called, I am guessing it just iterates over the model's modules and properly change types according to it's internal knowledge of what are the precision characteristics of the layers in CUDA. Something like that could be built for XLA as well, I guess.

mgrankin commented 4 years ago

Is it possible now to convert one existing layer to bf16 while keeping others on 32bit precision?

dlibenzi commented 4 years ago

I'd assume something like this should work. From:

def forward(self, x):
  return op(x)

To:

def forward(self, x):
  x16 = x.to(torch.bfloat16)
  return op(x16).to(torch.float32)

If you are looking for something which recurses into a model network and selectively converts certain nn.Module instances, we do not have that ATM.

mgrankin commented 4 years ago

I'm trying to convert the whole model to bfloat16 and then convert some layers back to fp32.

def bhalf(module):
    return module._apply(lambda t: t.to(torch.bfloat16) if t.is_floating_point() else t)

def bn2float(module:nn.Module)->nn.Module:
    "If `module` is batchnorm/LayerNorm don't use half precision."
    if isinstance(module, (torch.nn.modules.batchnorm._BatchNorm, torch.nn.LayerNorm)): module.float()
    for child in module.children(): bn2float(child)
    return module

def model2half(model:nn.Module)->nn.Module:
    "Convert `model` to half precision except the batchnorm layers."
    return bn2float(bhalf(model))

Something like this is done on fast.ai library for mixed precision CUDA training.

I'm getting RuntimeError: torch_xla/csrc/tensor_util.cpp:652 : Type not supported: BFloat16 during the conversion. I'm doing the conversion on CPU before moving to TPU, because the model is too big to fit on TPU. Am I doing it wrong?

mgrankin commented 4 years ago

Actually, I've tried on a smaller model and the error is the same if I do model2half after moving the model to TPU.

dlibenzi commented 4 years ago

Are you using the latest nightly wheels/docker? Mapping PyTorch BFloat16 to XLA BF16 support has been added a couple of weeks ago.

mgrankin commented 4 years ago

Thank you, I've updated the image and now I've got

RuntimeError: _th_equal not supported on CPUType for BFloat16

dlibenzi commented 4 years ago

Do you have a stack trace? I think there issue there is that there is one pytorch op we do not lower to XLA, so we do our usual dance in taking data from device, and running the op using pytorch CPU support ... but pytorch CPU does not support that op for BFloat16. Do you have any aten::* in your metrics report when running with plain float32?

mgrankin commented 4 years ago
Traceback (most recent call last):
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/root/ru_transformers/tpu_lm_finetuning.py", line 692, in main
    train(args, model, tokenizer)
  File "/root/ru_transformers/tpu_lm_finetuning.py", line 413, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 533, in forward
    head_mask=head_mask)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 427, in forward
    hidden_states = self.drop(hidden_states)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 54, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 807, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: _th_equal not supported on CPUType for BFloat16

I didn't create metrics report, should I create one?

dlibenzi commented 4 years ago

Hmm, we support dropout. Can you run in float32 and print like below for a couple of steps?

import torch_xla.debug.metrics as met
print(met.metrics_report())
mgrankin commented 4 years ago

metrics.txt

dlibenzi commented 4 years ago

Oh, I see, it's us generating the call. Let me fix that ...

import torch
import torch.nn as nn
import torch_xla
import torch_xla.core.xla_model as xm

d = xm.xla_device()
do = nn.Dropout(0.5).to(d)
x = torch.randn(3, 3).to(torch.bfloat16).to(d)
y = do(x)
#0  0x00007fffe89daafd in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffdf5592d5 in at::native::legacy::cpu::_th_equal (self=..., other=...) at aten/src/ATen/LegacyTHFunctionsCPU.cpp:2440
#2  0x00007fffdf49df1b in at::CPUType::(anonymous namespace)::equal (self=..., other=...) at aten/src/ATen/CPUType.cpp:3167
#3  0x00007fffdf50ef43 in c10::detail::WrapRuntimeKernelFunctor_<bool (*)(at::Tensor const&, at::Tensor const&), bool, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::operator() (this=0x55555614c720, args=...,
    args=...) at ../aten/src/ATen/core/boxing/kernel_lambda.h:23
#4  0x00007fffdf50ed4d in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<bool (*)(at::Tensor const&, at::Tensor const&), bool, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, bool (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) (functor=0x55555614c720, args=..., args=...) at ../aten/src/ATen/core/boxing/kernel_functor.h:260
#5  0x00007fffdba752c9 in c10::KernelFunction::callUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x5555568c9e70, args=..., args=...)
    at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/boxing/KernelFunction.h:129
#6  0x00007fffdba7522c in c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}::operator()(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&) const (this=0x7fffffff9210, backendFallbackKernels=...)
    at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:189
#7  0x00007fffdba7513e in c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}>(c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}&&) const (
    this=0x7fffe8900608 <c10::Dispatcher::singleton()::_singleton+144>, readFunc=...) at /usr/local/google/home/dlibenzi/google-git/pytorch/c10/util/LeftRight.h:74
#8  0x00007fffdba74e9e in c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x7fffe8900578 <c10::Dispatcher::singleton()::_singleton>, dispatchTable=..., backendFallbackKernels_=..., args=..., args=...)
    at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:186
#9  0x00007fffdba74e3a in c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}::operator()(c10::DispatchTable const&) const (this=0x7fffffff93d8, dispatchTable=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:179
#10 0x00007fffdba74d7e in c10::LeftRight<c10::DispatchTable>::read<c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const (this=0x55555614da78,
    readFunc=...) at /usr/local/google/home/dlibenzi/google-git/pytorch/c10/util/LeftRight.h:74
#11 0x00007fffdba74ae1 in c10::impl::OperatorEntry::readDispatchTable<c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const (this=0x55555614da00,
    functor=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/OperatorEntry.h:32
#12 0x00007fffdba74a9a in c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x7fffe8900578 <c10::Dispatcher::singleton()::_singleton>, op=..., args=..., args=...)
    at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:176
#13 0x00007fffdba749a0 in at::Tensor::equal (this=0x7fffffff98e8, other=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:6198
#14 0x00007fffdba64e4c in torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorComparer::operator() (this=0x555557ceed69, tensor1=..., tensor2=...) at torch_xla/csrc/tensor.cpp:162
#15 0x00007fffdba64df5 in xla::util::Cache<at::Tensor, xla::ComputationClient::Data, torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorHasher, torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorComparer>::Equaler::operator() (this=0x555557ceed69, k1=0x7fffffff98e8, k2=0x555557d096f0) at third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/cache.h:88
dlibenzi commented 4 years ago

With this the above should work:

https://gist.github.com/dlibenzi/4b449de70d95de7460f136c023c6092f

Need to upstream that, or a version of it.

mgrankin commented 4 years ago

I'm not sure what to do with that code.

dlibenzi commented 4 years ago

@mruberry is working on pushing it upstream pytorch. Once that lands, you should not be getting that error anymore.

mgrankin commented 4 years ago

Any updates?

dlibenzi commented 4 years ago

Sorry for the delay, we will push that soon.

dlibenzi commented 4 years ago

Sent https://github.com/pytorch/pytorch/pull/30817

dlibenzi commented 4 years ago

That PR went in. It should be in our nightly. If not today, for sure tomorrow.

mgrankin commented 4 years ago

Just tried with updated daily, same error on the same place.

dlibenzi commented 4 years ago

I have tried the snippet below, and it is working fine for me:

import torch
import torch.nn as nn
import torch_xla
import torch_xla.core.xla_model as xm

d = xm.xla_device()
do = nn.Dropout(0.5).to(d)
x = torch.randn(3, 3).to(torch.bfloat16).to(d)
y = do(x)

While before was giving that error. Did you update the dockers/wheels?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AdityaSoni19031997 commented 4 years ago

So do we now have the automatic layer dtype conversion as "amp" does? Also we can use grad_accumulation with TPU's, right? Thanks;

dlibenzi commented 4 years ago

We do not have an AMP-like solution ATM. There were talks to have something like that but they did not move further given limited resources and higher priorities.

Gradient accumulation does not have anything TPU specific and should work.

jysohn23 commented 4 years ago

@AdityaSoni19031997 When doing gradient accumulation just make sure you call xm.mark_step() after every forward pass, since without it you'll have a massive graph with N forwards + 1 backward.

AdityaSoni19031997 commented 4 years ago

Thanks for the head's up! Was just wondering, do we have an example for the same on colab dir or I can hopefully get it to working by tomorrow and send a PR? Thanks!

jysohn23 commented 4 years ago

We don't have a colab example but we do have an example of gradient accumulation at https://github.com/pytorch-tpu/fairseq/blob/tpu/fairseq/trainer.py#L396-L402. To clarify the above comment I made, you only need to explicitly call xm.mark_step() if your parallel_loader returns a list of N batches on which to run gradient accumulation on. If your parallel_loader returns 1 batch at a time and you apply gradient accumulation on N batches returned by it you don't need to explicitly mark_step since that's done for you by the loader.