Using Pytorch Geometric with Half-Precision via Nvidia AMP

sbonner0 commented 4 years ago

Hey,

I've been trying to use pytorch geometric with NVIDIA AMP and having some troubles - is it officially supported?

For example the following code does not run successfully with any opt level other than O1:

import time

import torch
import torch.nn as nn
import torch.nn.functional as F
from apex import amp
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

dataset = Planetoid(root='/tmp/Cora', name='CiteSeer')

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 32)
        self.conv2 = GCNConv(32, dataset.num_classes)

    def forward(self, data):
        data_x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(data_x, edge_index))
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)

opt_level = 'O0'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.train()

for epoch in range(500):

    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])

    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

    optimizer.step()

For example O0, O2 and O3 produces the following error:

IndexError: tensors used as indices must be long, byte or bool tensors

Is there anyway to fix this?

rusty1s commented 4 years ago

It's not officially supported. Our custom CUDA kernels currently do not run with half-precision, but we have plans to support that, see here. Your error is not related to half-precision though. Does AMP index with torch.int32 tensors?

sbonner0 commented 4 years ago

Thanks for your reply! So it seems AMP was converting the indices into half precision floats as well -- as you suggested. I was able to get around it by explicitly casting the indices to longs. The following code is able to run with all optimisation levels at least:

import torch
import torch.nn as nn
import torch.nn.functional as F
from apex import amp
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

dataset = Planetoid(root='/tmp/Cora', name='CiteSeer')

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 32)
        self.conv2 = GCNConv(32, dataset.num_classes)

    def forward(self, data):
        data_x, edge_index = data.x, data.edge_index.long()
        x = F.relu(self.conv1(data_x, edge_index))
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)

# Initialization
opt_level = 'O2'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.train()

for epoch in range(500):

    optimizer.zero_grad()
    out = model(data)

    loss = F.nll_loss(out[data.train_mask.long()], data.y[data.train_mask.long()].long())

    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

Should I trust it is being trained correctly however if half precision is not explicitly supported? Thanks for your link to the pull request - if you need any help testing it, please do let me know!

murnanedaniel commented 4 years ago

@sbonner0 I'm dealing with this exact situation currently. I'm not sure, but you may also run into a situation where the indices are changing when the edge_index tensor is passed into the model, because it converts the whole data list into float16. Have you noticed this issue? You can solve this by passing each tensor into the model separately. Let me know if you hit this issue. Also, I'm very keen to know if you are seeing any memory/speed improvements? I'm seeing zero boost from AMP, and am struggling to find the source of this. Possibly it's the custom Pytorch Geometric CUDA kernels.

sbonner0 commented 4 years ago

Hi @murnanedaniel thanks for your response! Are you saying that the index values themselves are being changed by the conversion to fp16? I shall investigate if this is happening for me as well as let you know.

I am also not seeing any real noticeable improvements in the speed or indeed memory usage when using AMP with Geometric. As you said, it could well be due to the custom geometric kernels not being half precision compatible at the moment - seems that will change soon with the upcoming PR.

murnanedaniel commented 4 years ago

I can't be sure, until you test your situation, but for me everything in the data list was being converted to half type, even int type tensors. This led to indices being rounded up and down (according to these rules), i.e. the graph getting totally messed up. The solution was (given a data object with long type edge_index and float type x) to pass in each individually:

out = model(data.x, data.edge_index.int())

Then convert edge_index back to long within the forward(). Then AMP understood which to convert and which to leave (since it doesn't touch int tensors by design). If you find a better way to handle this, let me know!

murnanedaniel commented 4 years ago

And just a follow-up to memory usage - I'm now seeing a 50% drop in peak GPU memory usage with the O2 level of AMP, which is significant. But no speed improvements.

sbonner0 commented 4 years ago

Hi @murnanedaniel great news on the memory usage - did you have to do anything extra in the code to get that to work? How is the accuracy at the O2 level?

murnanedaniel commented 4 years ago

Once the data types were handled correctly, I didn't have to do anything further. I hadn't seen the boost earlier as the model was quite small compared to the data, but boosting the layers and hidden features made the memory benefits more clear. The accuracy was basically unaffected with O2, with occasional gradient overflows. But turning off the master weights, or going to O3 led to severe overflow. This may be because the GNN was quite deep? I'm still looking into that issue. It may be unsolvable.

sbonner0 commented 2 years ago

Seems that the PR here has fixed this issue.

pyg-team / pytorch_geometric

Using Pytorch Geometric with Half-Precision via Nvidia AMP #1400