pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.52k stars 3.69k forks source link

pytorch dataparallel with QM9 #1498

Open HanChen-HUST opened 4 years ago

HanChen-HUST commented 4 years ago

hello,i have two gpus,and i want to train with them with torch.nn.dataparallel. i change the file in here:

os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' device = torch.device('cuda') model = Net() model=torch.nn.DataParallel(model) model.to(device)

but it occurs error:RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/THC/generic/THCTensorMathBlas.cu:270

how can i fix it?thanks for ur reply!

rusty1s commented 4 years ago

You need to use torch_geometric.nn.DataParallel. You can find an example here.

HanChen-HUST commented 4 years ago

thank you ,i change it with qm9,but it occurs with another error:AttributeError: 'tuple' object has no attribute 'num_nodes'

rusty1s commented 4 years ago

Can you show me a minimal example to reproduce this?

HanChen-HUST commented 4 years ago

import os os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
import torch import torch.optim as optim import torch.nn as nn from torch_geometric.data import DataLoader import torch_geometric.transforms as T from modelmof import mof import os.path as osp device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu") from torch_geometric.utils import remove_self_loops import torch import torch.nn.functional as F from torch.nn import Sequential, Linear, ReLU, GRU import torch_geometric.transforms as T from torch_geometric.nn import NNConv, Set2Set from torch_geometric.data import DataLoader from torch_geometric.utils import remove_self_loops import numpy as np dim=64 from torch_geometric.nn import DataParallel class MyTransform(object): def call(self, data):

Specify target.

    data.y = data.y[:, target]
    return data

class Complete(object): def call(self, data): device = data.edge_index.device row = torch.arange(data.num_nodes, dtype=torch.long, device=device) col = torch.arange(data.num_nodes, dtype=torch.long, device=device) row = row.view(-1, 1).repeat(1, data.num_nodes).view(-1) col = col.repeat(data.num_nodes) edge_index = torch.stack([row, col], dim=0) edge_attr = None if data.edge_attr is not None: idx = data.edge_index[0] data.num_nodes + data.edge_index[1] size = list(data.edge_attr.size()) size[0] = data.num_nodes data.num_nodes edge_attr[idx] = data.edge_attr edge_index, edge_attr = remove_self_loops(edge_index, edge_attr) data.edge_attr = edge_attr data.edge_index = edge_index return data

path = osp.join(osp.dirname(osp.realpath(file)), '..', 'mydataset.pt') transform = T.Compose([Complete(),T.Distance(norm=False)]) dataset = mof(path,transform = transform ).shuffle() train_dataset = dataset[2000:] train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

class Net(torch.nn.Module): def init(self): super(Net, self).init() self.lin0 = torch.nn.Linear(100, dim)

    nn = Sequential(Linear(1, 128), ReLU(), Linear(128, dim * dim))
    self.conv = NNConv(dim, dim, nn, aggr='mean')
    self.gru = GRU(dim, dim)
    self.set2set = Set2Set(dim, processing_steps=3)
    self.lin1 = torch.nn.Linear(2 * dim, dim)
    self.lin2 = torch.nn.Linear(dim, 1)

def forward(self, data):
    out = F.relu(self.lin0(data.x))                     
    h = out.unsqueeze(0)        
    for i in range(3):            
        m = F.relu(self.conv(out, data.edge_index, data.edge_attr))            
        out, h = self.gru(m.unsqueeze(0), h)
        out = out.squeeze(0)            
    out = self.set2set(out, data.batch)
    out = F.relu(self.lin1(out))
    out = self.lin2(out)
    return out.view(-1)

model = Net() print('Let\'s use', torch.cuda.device_count(), 'GPUs!') model =DataParallel(model)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') model = model.to(device) optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for data_list in train_loader: optimizer.zero_grad() output = model(data_list) y = torch.cat([data.y for data in data_list]).to(output.device) loss = F.nll_loss(output, y)
loss.backward() optimizer.step()

rusty1s commented 4 years ago

Note that you need to use DataListLoader for loading your data when using DataParallel.

HanChen-HUST commented 4 years ago

could you give me an example,i don‘t know what do you mean,much botherding,thanks!

rusty1s commented 4 years ago

See https://github.com/rusty1s/pytorch_geometric/blob/master/examples/data_parallel.py#L6

HanChen-HUST commented 4 years ago

thanks,it works,and another question,can i add the bond distance value in PYG edge_attr,i see it is one-hot format,how could i add it.

rusty1s commented 4 years ago

Can you clarify? torch_geometric.transforms.Distance should automatically take care of that.

HanChen-HUST commented 4 years ago

i consider pytorch_geometric.nn.edge_attr is the place to add the distance between the atoms,so i add it in it,but QM9 use one-hot vector to represent it,so if i want to use the true distance in PYG,how could i add it?QM9_NN_CONV.py didn't use pos information ,i also want to use it,how can i fix it,and it doesn't perform well in large atom number graphs,what's the possible reason?Thanks

HanChen-HUST commented 4 years ago

the task in about Graph Regression,which model should i use?

rusty1s commented 4 years ago

The QM9 example does use pos information, and it does so by calculating the distances based on source and target nodes, and adding them to edge_attr, so, e.g. like this (implemented in T.Distance):

row, col = edge_index
dist = (pos[row] - pos[col]).norm(dim=-1)
edge_attr = torch.cat([edge_attr, dist.unsqueeze(-1)], dim=-1)

Regarding regression and classification, their major difference is their usage of different loss formulations. The model isn't that much affected by that, so you can basically use any model for classification also for regression.

HanChen-HUST commented 4 years ago

thanks,but one atom may have different nums bond,so how can i add it into edge_attr?

here https://github.com/rusty1s/pytorch_geometric/blob/271146a1c82aa077442206002911dbbda7053d7c/torch_geometric/datasets/qm9.py#L243

must the the formulations of edge_attr be one-hot?so how can i define the distance one-hot?

rusty1s commented 4 years ago

It doesn't have to be, but it's more convenient to use if you want to add continuous edge features to the graph

Frank-LIU-520 commented 4 years ago

It doesn't have to be, but it's more convenient to use if you want to add continuous edge features to the graph

when I used dataparallel layer in MPNN with 2 gpus, the gpu volatile GPU Util becomes very low at about 44%. What is the problem? At this situation, should I just use 1 gpu without dataparallel to train the model more efficiently?

rusty1s commented 4 years ago

I think that heavily depends on the batch_size of your training process. For molecules, using small batch sizes is general a good idea for better training, but it comes at the cost of low GPU utilization. The GPU should be easily be able to fit batch sizes of 512 or 1024.

yanty123 commented 4 years ago

See https://github.com/rusty1s/pytorch_geometric/blob/master/examples/data_parallel.py#L6

Sir, could you please give me the new link, the old one was broken. I also need to see how to use this way.

rusty1s commented 4 years ago

All multi GPU examples have been moved to https://github.com/rusty1s/pytorch_geometric/tree/master/examples/multi_gpu (including newly introduced distributed training examples).

yanty123 commented 4 years ago

All multi GPU examples have been moved to https://github.com/rusty1s/pytorch_geometric/tree/master/examples/multi_gpu (including newly introduced distributed training examples).

ok thank you very much!

adham-synbio commented 3 years ago

Hello, I'm using multi gpu with Davis dataset for DTI prediction. my task is regression task. although my code implements DataParallel () and DataListLoader (), I'm getting the following error: AttributeError: 'tuple' object has no attribute 'num_nodes'

ps: the code works fine on single gpu

rusty1s commented 3 years ago

Can you show me a minimal example? My guess is that your dataset returns a tuple rather than a single data object.

adham-synbio commented 3 years ago

trainer.txt

I've attached a part of my script , since the whole script is too big. Thanks for your comment

rusty1s commented 3 years ago

I don't think you need to convert your data_list returned by self.train_dataloader to a Batch object manually. This is handled internally by torch_geometric.data.DataParallel, see here.