Closed mgrankin closed 4 years ago
I think the issue for us is that bfloat16 support on pytorch CPU is incomplete, so even though we can map our BF16 to pytorch BFloat16 type, we might get into trouble if we had an unsupported pytorch op which requires us to go to CPU. I will try to wire it up and see what happens.
https://github.com/pytorch/xla/pull/1331
Keep in mind that you will find issues in PT when using bfloat16. Example, this does not work:
t = torch.randn(2, 2, dtype=torch.bfloat16)
Gives:
RuntimeError: _th_normal_ not supported on CPUType for BFloat16
You could get around issuing a randn()
with float32
and doing a to(bfloat16)
but if you plan to use existing models or layers implementations, you might be looking at a lot of changes.
The above PR just raises our bar so that when PT is bfloat16-ready, we should be as well.
I have no issue with PyTorch CPU BFloat16 support. I don't get any error like you mentioned above.
My model training poorly on BFloat16. The reason (my guess) is some layers should not be trained on 16bit mode. Like batchnorm layer for instance, this layer should be kept at 32. Nvidia/Apex does this automatically using the O2 mode
. Should we expect such mode for PyTorch/XLA?
Strange you do not get that error. I am trying on PyTorch HEAD, and I get that. Maybe you are on official PT 1.3 and things got broken in the recent commits?
As far as PyTorch/XLA, it uses the types that are coming down from PT, aside from the global use-bf16 flag (which is there only because PT did not have bfloat16 before). Making decisions for the user an a layer by layer case does not seem like a good policy. It should be the upper level model builder that tells the lower SW layer which types tensors should be, because if we bolt a behavior in the lower layers, then it is hard to undo for users which do not want to subscribe to it.
Deciding for the user which layers should be with wich precision is exactly what AMP library does in O2 mode. As a practitioner I use this feature all the time. There is simply no reason not to - it one line of code and your model is training much faster with less memory and same accuracy.
That is exactly what user policy means.
The user have to modify their code to inject an "alien" component like amp
, to whom the handling of precision is explicitly delegated.
Not that a vanilla PyTorch model+loss+optimizer precision characteristics are silently changed from underneath him by a backend.
Thank you for clarifying. So automatic mixed precision is out of the scope for this project.
Is there any example how non-automatic mixed precision should be done? Like how I convert one existing layer to bf16 while keeping others on 32bit precision?
I don't know the details of AMP, but from the way it is called, I am guessing it just iterates over the model's modules and properly change types according to it's internal knowledge of what are the precision characteristics of the layers in CUDA. Something like that could be built for XLA as well, I guess.
Is it possible now to convert one existing layer to bf16 while keeping others on 32bit precision?
I'd assume something like this should work. From:
def forward(self, x):
return op(x)
To:
def forward(self, x):
x16 = x.to(torch.bfloat16)
return op(x16).to(torch.float32)
If you are looking for something which recurses into a model network and selectively converts certain nn.Module
instances, we do not have that ATM.
I'm trying to convert the whole model to bfloat16 and then convert some layers back to fp32.
def bhalf(module):
return module._apply(lambda t: t.to(torch.bfloat16) if t.is_floating_point() else t)
def bn2float(module:nn.Module)->nn.Module:
"If `module` is batchnorm/LayerNorm don't use half precision."
if isinstance(module, (torch.nn.modules.batchnorm._BatchNorm, torch.nn.LayerNorm)): module.float()
for child in module.children(): bn2float(child)
return module
def model2half(model:nn.Module)->nn.Module:
"Convert `model` to half precision except the batchnorm layers."
return bn2float(bhalf(model))
Something like this is done on fast.ai library for mixed precision CUDA training.
I'm getting RuntimeError: torch_xla/csrc/tensor_util.cpp:652 : Type not supported: BFloat16
during the conversion. I'm doing the conversion on CPU before moving to TPU, because the model is too big to fit on TPU.
Am I doing it wrong?
Actually, I've tried on a smaller model and the error is the same if I do model2half
after moving the model to TPU.
Are you using the latest nightly wheels/docker? Mapping PyTorch BFloat16 to XLA BF16 support has been added a couple of weeks ago.
Thank you, I've updated the image and now I've got
RuntimeError: _th_equal not supported on CPUType for BFloat16
Do you have a stack trace?
I think there issue there is that there is one pytorch op we do not lower to XLA, so we do our usual dance in taking data from device, and running the op using pytorch CPU support ... but pytorch CPU does not support that op for BFloat16.
Do you have any aten::*
in your metrics report when running with plain float32?
Traceback (most recent call last):
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "/root/ru_transformers/tpu_lm_finetuning.py", line 692, in main
train(args, model, tokenizer)
File "/root/ru_transformers/tpu_lm_finetuning.py", line 413, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 533, in forward
head_mask=head_mask)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 427, in forward
hidden_states = self.drop(hidden_states)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 54, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 807, in dropout
else _VF.dropout(input, p, training))
RuntimeError: _th_equal not supported on CPUType for BFloat16
I didn't create metrics report, should I create one?
Hmm, we support dropout. Can you run in float32 and print like below for a couple of steps?
import torch_xla.debug.metrics as met
print(met.metrics_report())
Oh, I see, it's us generating the call. Let me fix that ...
import torch
import torch.nn as nn
import torch_xla
import torch_xla.core.xla_model as xm
d = xm.xla_device()
do = nn.Dropout(0.5).to(d)
x = torch.randn(3, 3).to(torch.bfloat16).to(d)
y = do(x)
#0 0x00007fffe89daafd in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007fffdf5592d5 in at::native::legacy::cpu::_th_equal (self=..., other=...) at aten/src/ATen/LegacyTHFunctionsCPU.cpp:2440
#2 0x00007fffdf49df1b in at::CPUType::(anonymous namespace)::equal (self=..., other=...) at aten/src/ATen/CPUType.cpp:3167
#3 0x00007fffdf50ef43 in c10::detail::WrapRuntimeKernelFunctor_<bool (*)(at::Tensor const&, at::Tensor const&), bool, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::operator() (this=0x55555614c720, args=...,
args=...) at ../aten/src/ATen/core/boxing/kernel_lambda.h:23
#4 0x00007fffdf50ed4d in c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<bool (*)(at::Tensor const&, at::Tensor const&), bool, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, bool (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) (functor=0x55555614c720, args=..., args=...) at ../aten/src/ATen/core/boxing/kernel_functor.h:260
#5 0x00007fffdba752c9 in c10::KernelFunction::callUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x5555568c9e70, args=..., args=...)
at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/boxing/KernelFunction.h:129
#6 0x00007fffdba7522c in c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}::operator()(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&) const (this=0x7fffffff9210, backendFallbackKernels=...)
at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:189
#7 0x00007fffdba7513e in c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}>(c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}&&) const (
this=0x7fffe8900608 <c10::Dispatcher::singleton()::_singleton+144>, readFunc=...) at /usr/local/google/home/dlibenzi/google-git/pytorch/c10/util/LeftRight.h:74
#8 0x00007fffdba74e9e in c10::Dispatcher::doCallUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x7fffe8900578 <c10::Dispatcher::singleton()::_singleton>, dispatchTable=..., backendFallbackKernels_=..., args=..., args=...)
at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:186
#9 0x00007fffdba74e3a in c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}::operator()(c10::DispatchTable const&) const (this=0x7fffffff93d8, dispatchTable=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:179
#10 0x00007fffdba74d7e in c10::LeftRight<c10::DispatchTable>::read<c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const (this=0x55555614da78,
readFunc=...) at /usr/local/google/home/dlibenzi/google-git/pytorch/c10/util/LeftRight.h:74
#11 0x00007fffdba74ae1 in c10::impl::OperatorEntry::readDispatchTable<c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::{lambda(c10::DispatchTable const&)#1}&&) const (this=0x55555614da00,
functor=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/OperatorEntry.h:32
#12 0x00007fffdba74a9a in c10::Dispatcher::callUnboxed<bool, at::Tensor const&, at::Tensor const&> (this=0x7fffe8900578 <c10::Dispatcher::singleton()::_singleton>, op=..., args=..., args=...)
at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:176
#13 0x00007fffdba749a0 in at::Tensor::equal (this=0x7fffffff98e8, other=...) at /usr/local/google/home/dlibenzi/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:6198
#14 0x00007fffdba64e4c in torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorComparer::operator() (this=0x555557ceed69, tensor1=..., tensor2=...) at torch_xla/csrc/tensor.cpp:162
#15 0x00007fffdba64df5 in xla::util::Cache<at::Tensor, xla::ComputationClient::Data, torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorHasher, torch_xla::(anonymous namespace)::XlaDataCacheArena::TensorComparer>::Equaler::operator() (this=0x555557ceed69, k1=0x7fffffff98e8, k2=0x555557d096f0) at third_party/tensorflow/bazel-tensorflow/tensorflow/compiler/xla/xla_client/cache.h:88
With this the above should work:
https://gist.github.com/dlibenzi/4b449de70d95de7460f136c023c6092f
Need to upstream that, or a version of it.
I'm not sure what to do with that code.
@mruberry is working on pushing it upstream pytorch. Once that lands, you should not be getting that error anymore.
Any updates?
Sorry for the delay, we will push that soon.
That PR went in. It should be in our nightly. If not today, for sure tomorrow.
Just tried with updated daily, same error on the same place.
I have tried the snippet below, and it is working fine for me:
import torch
import torch.nn as nn
import torch_xla
import torch_xla.core.xla_model as xm
d = xm.xla_device()
do = nn.Dropout(0.5).to(d)
x = torch.randn(3, 3).to(torch.bfloat16).to(d)
y = do(x)
While before was giving that error. Did you update the dockers/wheels?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
So do we now have the automatic layer dtype conversion as "amp" does? Also we can use grad_accumulation with TPU's, right? Thanks;
We do not have an AMP-like solution ATM. There were talks to have something like that but they did not move further given limited resources and higher priorities.
Gradient accumulation does not have anything TPU specific and should work.
@AdityaSoni19031997 When doing gradient accumulation just make sure you call xm.mark_step()
after every forward pass, since without it you'll have a massive graph with N forwards + 1 backward.
Thanks for the head's up! Was just wondering, do we have an example for the same on colab dir or I can hopefully get it to working by tomorrow and send a PR? Thanks!
We don't have a colab example but we do have an example of gradient accumulation at https://github.com/pytorch-tpu/fairseq/blob/tpu/fairseq/trainer.py#L396-L402. To clarify the above comment I made, you only need to explicitly call xm.mark_step()
if your parallel_loader returns a list of N batches on which to run gradient accumulation on. If your parallel_loader returns 1 batch at a time and you apply gradient accumulation on N batches returned by it you don't need to explicitly mark_step since that's done for you by the loader.
Any plans to introduce mixed precision training as an analog to opt_level O2 in Nvidia/Apex? I'm training GPT-2 model right now. It's not training well with XLA_USE_BF16=1. I can get perplexity 120 with BF16 and perplexity 65 with full precision (355M model). If I'm training on GPU I see no difference in perplexity in mixed precision mode vs full precision.
One more problem with full precision - the big model (774M) can't be trained in full precision at all because there is not enough memory in TPU v3-8. I can fit it in BF16 mode but it's not training well as I said earlier.