How to make ``scatter`` (Just CUDA) results repeatable.

boliqq07 commented 3 years ago

❓ Questions & Help

I try to repeat my work, but find the scatter in torch_scatter (cuda) is un-stable, though with defined random seed.

Due to the scatter is in class MessagePassing, I thought It is worth paying attention to.

Or I made mistake or neglected someting?

The following are my test results.

I'd appreciate it, if anyone could help or idea.

file.md

import os
import random

import numpy as np
import torch
from torch import nn
from torch.backends import cudnn
from torch.nn import Linear, Softplus
from torch_scatter import scatter

class ReadOutLayer(nn.Module):
    """Merge node layer."""

    def __init__(self, num_filters, out_size=1, readout="add", temp_to_cpu=True):
        super(ReadOutLayer, self).__init__()
        self.readout = readout
        self.lin1 = Linear(num_filters, num_filters * 5)
        self.s1 = Softplus()
        self.lin2 = Linear(num_filters * 5, num_filters)
        self.s2 = Softplus()
        self.lin3 = Linear(num_filters, out_size)
        self.temp_to_cpu = temp_to_cpu

    def forward(self, h, batch):
        h = self.lin1(h)
        h = self.s1(h)
        h = self.jump(h, batch)
        h = self.lin2(h)
        h = self.s2(h)
        h = self.lin3(h)
        return h

    def jump(self, h, batch):
        if self.temp_to_cpu:
            # torch.geometric scatter is unstable especially for small data in cuda device.?
            old_device = h.device
            device = torch.device("cpu")
            h = h.to(device=device)
            batch = batch.to(device=device)
            h = scatter(h, batch, dim=0, reduce=self.readout)
            h = h.to(device=old_device)
        else:
            h = scatter(h, batch, dim=0, reduce=self.readout)

        return h

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    # torch.backends.cudnn.enabled = False
    cudnn.deterministic = True
    cudnn.benchmark = False

set_seed(1)

##### get data.x data.y, data.batch
x = torch.rand((1000, 100), requires_grad=True, )
y = torch.rand((100, 1), requires_grad=True, )
batch_mark = torch.randint(low=0, high=1000, size=(100,))
batch_mark = torch.sort(batch_mark).values
batch = torch.zeros((1000,))
for n, i in enumerate(batch_mark):
    batch[i:] = n

batch = batch.to(torch.int64)

# model definition

def scatter_check(x, y, batch, test):

    if test == "just cpu":
        temp_to_cpu = False
        device = torch.device("cpu")

    elif test == "cuda with cpu scatter":
        temp_to_cpu = True
        device = torch.device("cuda:0")

    elif test == "cuda":
        temp_to_cpu = False
        device = torch.device("cuda:0")
    else:
        raise  NotImplementedError

    model = ReadOutLayer(100, temp_to_cpu=temp_to_cpu)
    x = x.to(device)
    y = y.to(device)
    batch = batch.to(device)
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)
    loss_method = torch.nn.MSELoss()

    loss_ir = 0

    for i in range(300):
        p_y = model(x, batch)

        lossi = loss_method(p_y, y)
        print(i, lossi.item())
        loss_ir = lossi.item()

        lossi.backward()
        optimizer.step()
        optimizer.zero_grad()
    return loss_ir

set_seed(1)
a= scatter_check(x, y, batch, test="just cpu")
set_seed(1)
b= scatter_check(x, y, batch, test="just cpu")
# cpu is ok
assert a==b

set_seed(1)
a = scatter_check(x, y, batch, test="cuda with cpu scatter")
set_seed(1)
b = scatter_check(x, y, batch, test="cuda with cpu scatter")
# use cuda but with jump to cpu to run ``scatter``  is ok.
assert a==b

set_seed(1)
a = scatter_check(x, y, batch, test="cuda")
set_seed(1)
b = scatter_check(x, y, batch, test="cuda")
# cuda is fail.
assert a!=b

rusty1s commented 3 years ago

Scatter is a non-deterministic operation by design since it makes use of atomic operations in which the order of aggregation is non-deterministic, leading to minor numerical differences. As an alternative, you can make use of the segment_csr operation of torch_scatter, see here.

For message passing layers, deterministic aggregation is only guaranteed when using SparseTensor.

In the end, I wouldn't worry too much about it. In a deep learning scenario, such numerical instabilities should be only noticeable on really small datasets. Although it is correct that exact reproducible is no longer guaranteed when using non-deterministic operations, we can only enforce reproducibility for a single permutation (which does not exist in the context of graphs).

boliqq07 commented 3 years ago

Scatter is a non-deterministic operation by design since it makes use of atomic operations in which the order of aggregation is non-deterministic, leading to minor numerical differences. As an alternative, you can make use of the segment_csr operation of torch_scatter, see here.

For message passing layers, deterministic aggregation is only guaranteed when using SparseTensor.

In the end, I wouldn't worry too much about it. In a deep learning scenario, such numerical instabilities should be only noticeable on really small datasets. Although it is correct that exact reproducible is no longer guaranteed when using non-deterministic operations, we can only enforce reproducibility for a single permutation (which does not exist in the context of graphs).

    def __lift__(self, src, edge_index, dim):
        if isinstance(edge_index, Tensor):
            index = edge_index[dim]
            return src.index_select(self.node_dim, index)
        elif isinstance(edge_index, SparseTensor):
            if dim == 1:
                rowptr = edge_index.storage.rowptr()
                rowptr = expand_left(rowptr, dim=self.node_dim, dims=src.dim())
                return gather_csr(src, rowptr)
            elif dim == 0:
                col = edge_index.storage.col()
                return src.index_select(self.node_dim, col)
        raise ValueError

(This code in MessagePassing)

I try to use SparseTensor, segemt_scr and gather_scr, it is deterministic. But for one general network, The problem of un-repeatable still exist.

Finally, I find the problem in index_select function of torch, The index_select could be the non-deterministic operation too.

Since I try to use [ ] but not index_select and it works though with reduced versatility.

Replace the # return src.index_select(self.node_dim, col) by return src[col]

All the thing is OK.

rusty1s commented 3 years ago

Running torch.use_deterministic_algorithms(True) should fix that as well, I guess :)

shuoyinn commented 2 years ago

@rusty1s Hello, you said scatter could result in indeterminacy, and thus minor numerical differences occur. But when it comes to scatter, intrinsically this operation is permutation invariant as you said, then why there will be difference since it is not depend on the order by which the elements scatter? What I mean, you see, 1 + 2 + 3, and 2 + 1 + 3, then gradient and anything will be the same, right? Then why difference occurs?

As far as I know, cuda version scatter will use several parallel "sub things" to implement this operation, is it the reason?

rusty1s commented 2 years ago

Yes, this is due to how floating-point precision works. In case the ordering of operations is not deterministic internally, you may get slightly different outputs, e.g., (1 + 2) + 3 may be different from 1 + (2 + 3).

shuoyinn commented 2 years ago

Then I understand, and thank you for your such an immediate reply. Very helpful.

jiaqian commented 1 year ago

Scatter is a non-deterministic operation by design since it makes use of atomic operations in which the order of aggregation is non-deterministic, leading to minor numerical differences. As an alternative, you can make use of the segment_csr operation of torch_scatter, see here. For message passing layers, deterministic aggregation is only guaranteed when using SparseTensor. In the end, I wouldn't worry too much about it. In a deep learning scenario, such numerical instabilities should be only noticeable on really small datasets. Although it is correct that exact reproducible is no longer guaranteed when using non-deterministic operations, we can only enforce reproducibility for a single permutation (which does not exist in the context of graphs).
    def __lift__(self, src, edge_index, dim):
        if isinstance(edge_index, Tensor):
            index = edge_index[dim]
            return src.index_select(self.node_dim, index)
        elif isinstance(edge_index, SparseTensor):
            if dim == 1:
                rowptr = edge_index.storage.rowptr()
                rowptr = expand_left(rowptr, dim=self.node_dim, dims=src.dim())
                return gather_csr(src, rowptr)
            elif dim == 0:
                col = edge_index.storage.col()
                return src.index_select(self.node_dim, col)
        raise ValueError
(This code in MessagePassing)

I try to use SparseTensor, segemt_scr and gather_scr, it is deterministic. But for one general network, The problem of un-repeatable still exist.

Finally, I find the problem in index_select function of torch, The index_select could be the non-deterministic operation too.

Since I try to use [ ] but not index_select and it works though with reduced versatility.

Replace the # return src.index_select(self.node_dim, col) by return src[col]

All the thing is OK.

Hi, could you please share how to use SparseTensor? I am pretty struggling with it, say some functions like negative_sampling only supports tensor rather than SparseTensor. Where you convert normal tensor to SparseTensor? Thanks:)

rusty1s commented 1 year ago

Did you take a look at https://pytorch-geometric.readthedocs.io/en/latest/advanced/sparse_tensor.html? You can convert between the two via:

row, col, edge_attr = adj_t.t().coo()
edge_index = torch.stack([row, col], dim=0)

jiaqian commented 1 year ago

Did you take a look at https://pytorch-geometric.readthedocs.io/en/latest/advanced/sparse_tensor.html? You can convert between the two via:
row, col, edge_attr = adj_t.t().coo()
edge_index = torch.stack([row, col], dim=0)

Hi Matthias, thanks for the quick reply. I think in my case the big difference between rounds mostly coming from the training data batch generation, do you have any suggestion how to make batch generation (using NeighborLoader, set all the seed as suggested above) deterministic?

rusty1s commented 1 year ago

I think NeighborLoader should be deterministic if you set a manual seed. Which pyg-lib/torch-sparse version are you using?

jiaqian commented 1 year ago

I think NeighborLoader should be deterministic if you set a manual seed. Which pyg-lib/torch-sparse version are you using?

0.6.13 for torch_sparse, I tried to fix the seed, but I got one batch with different numbers of edge pairs and even different numbers of root_node for different rounds. Btw, my graph is heterogeneous, not sure if it has some impact.

rusty1s commented 1 year ago

Deterministic neighborhood sampling is available from torch-sparse 0.6.14 onwards, see here.

jiaqian commented 1 year ago

Deterministic neighborhood sampling is available from torch-sparse 0.6.14 onwards, see here.

I think I found the problem. Just set device="cpu" is not enough to disable cuda, I create a new environment for cpu version of torch and pyg, it is reproducible now. Thanks for the help:)

GregoireLamb commented 6 months ago

Did you take a look at https://pytorch-geometric.readthedocs.io/en/latest/advanced/sparse_tensor.html? You can convert between the two via:
row, col, edge_attr = adj_t.t().coo()
edge_index = torch.stack([row, col], dim=0)

Hello!

I'm facing a similar issue and wonder if / how to use SparseTensors. What should be transformed to a sparse tensor to get a deterministic output?

I use graph = T.Compose([T.ToSparseTensor()])(graph) to get an adjacency matrix of type <class 'torch_sparse.tensor.SparseTensor'>. My graph.x remain of type <class 'torch.Tensor'> but it does not contains '0' and changing it to sparseTensor leads to errors (it doesn't have to have Strided Layout). I found here that dense tensor should work.

Idk if it can help but a strange thing is that the behaviors of my different runs only start to diverge after a some iterations as seen in the picture (I checked with a precision of 10^-16 and the 12-15 first loss value are the exact same )

rusty1s commented 6 months ago

Yes, graph.x should be a dense tensor.

It's hard for me to say what might have gone wrong here. There might be other sources of non-determinism (e.g., differently sampled mini-batches from step 70 onwards).

pyg-team / pytorch_geometric

How to make ``scatter`` (Just CUDA) results repeatable. #2788

❓ Questions & Help