[REQUEST] When training a FP16 model, the ability to set some of the layers to FP32

BlinkDL commented 2 years ago

When training a FP16 model, I wonder if it's possible to set some of the layers to FP32.

I can add .to(torch.float16) and .to(torch.float32) to do the conversion between layers.

So the process will be like:

...
x = f(x)    # x and f(x) are float16
x = x.to(torch.float32)
x = g(x)   # x and g(x) are float32
x = x.to(torch.float16)
x = f(x)   # x and f(x) are float16
...

tjruwase commented 2 years ago

@BlinkDL, thanks for your question. If I understand correctly, it seems there are two parts to this.

First, when you say f(x) is float16 and g(x) is float32, I believe what you mean is that the weights and bias of f are float16 and the weights and bias of g are float32. One way to achieve this that during model construction, which occurs outside deepspeed, the layers corresponding to f and g constructed with float16 and float32 respectively. Alternatively, one could take a constructed model and modify specific layer weights and biases to the desired dtypes. I think you can prototype this behavior with simple models for testing.

Second, you will need automatic or manual type conversions in both the forward and backward passes for adjacent layers of different types. Your code snippet is an example of such manual type conversions for forward pass. It is not clear to me if autograd could automatically generate the corresponding type casts for the backward pass. amp already provides some of these features, and it might be worth reading this and this.

After these two issues are addressed, we would need to make some changes in DeepSpeed, especially the zero optimizers for full support.

Hope that helps.

BlinkDL commented 2 years ago

I can do first and second manually.

So now it's about DeepSpeed support :)

tjruwase commented 2 years ago

@BlinkDL, it is great to hear you have (1) and (2) working. For us to understand the required DeepSpeed support, can you share an example that already incorporates (1) and (2)? I expect it will fail in DeepSpeed. The stack trace would be a good hint of where to start the DeepSpeed support. Thanks!

microsoft / DeepSpeed

[REQUEST] When training a FP16 model, the ability to set some of the layers to FP32 #2100