pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.95k stars 3.61k forks source link

Issue reproducing the results of the original ecc implementation. Pooling layer and conv layer are giving different results of the original implementation #331

Closed dhorka closed 5 years ago

dhorka commented 5 years ago

As I mentioned in #319 I have problems to reproduce the ecc implemenation using pytorch_geometric. I found some differences between the results obtained, first one is that the results of both convolution operations using the same weights have different results. Moreover, the results of the pooling layers are also different.

I created a test that checks this things. Basically, the scripts load the same weights to both implementations. These weights are obtained from train a network using the ecc_implementation. Below you can see the output of my test.

ECC Weights and PyGeometric weights are equal: True #I am only doing a re-check in order to be sure that both weights are equal before to load to the models.
Loading weights 
Starting validation:
ecc features conv1:  (997, 16) #Shape of the output of first conv in ecc implementation
pygeometric features conv1:  (997, 16) #Shape of the output of first conv in pygeometric implementation
Max difference between features of first conv 2.549824
Output of ecc pooling:  (398, 32)
Output of PyGeometric pooling:  (385, 32)
Pygeomtric Acc:  41.51982378854625  Ecc accuracy:  63.65638766519823
Pygeomtric Loss:  2.435516586519023  Ecc Loss:  0.9878960176974138

As you can observe this difference has an impact to the accuracy using the same weights. You can find the source code here. One important thing, the data used for this tests is obtained from the original code of the ecc.

rusty1s commented 5 years ago

Exactly.

rusty1s commented 5 years ago

Ah! Yes, the pooling map in ecc is obtained by a nearest neighbor search on the coarsened positions, not based on voxel affiliation like in PyG.

dhorka commented 5 years ago

We can add a flag in order to be able to choose between both algorithms?

rusty1s commented 5 years ago
from torch_geometric.nn.pool.consecutive import consecutive_cluster
from torch_geometric.utils import scatter_
from torch_cluster import nearest
cluster = voxel_grid(
        data.pos,
        data.batch,
        self.pool_rad,
        start=data.pos.min(dim=0)[0] - self.pool_rad * 0.5,
        end=data.pos.max(dim=0)[0] + self.pool_rad * 0.5)

cluster, perm = consecutive_cluster(cluster)
new_pos = scatter_('mean', data.pos, cluster)
new_batch = data.batch[perm]
cluster = nearest(data.pos, new_pos, data.batch, new_batch)
data.x = scatter_('max', data.x, cluster)
data.pos = new_pos
data.batch = new_batch
return data

This computes new node features like it is done in the ecc implementation.

rusty1s commented 5 years ago

We can not simply add a flag to max_pool to achieve this. We need to add our own implementation of this operator, or simply let users define this implementation on their own.

dhorka commented 5 years ago

Thanks! I will check it!

dhorka commented 5 years ago

These are the reuslts after check:

ecc features conv1: (993, 16) pygeometric features conv1: (993, 16) Max difference between features of conv1 7.1525574e-07 ecc features conv2: (993, 32) pygeometric features conv2: (993, 32) Max difference between features of conv2 2.3841858e-07 Output of ecc pooling: (257, 32) Output of PyGeometric pooling: torch.Size([28, 32]) Sum max_pool ecc 2773.065 Sum max_pool pygeometric 328.0657 Difference between features of max_pool1 2444.9993 Size positions ecc: (257, 3) Size positions pyg: (28, 3) Sum positions ecc: 7565.66983967046 Sum positions pyg: 841.835113007279 Difference positions: 6723.834726663181

It seems like is not achieving the same results, right? The code used as a max_pool is this one:

class GraphMaxPooling(torch.nn.Module):
      def __init__(self, pool_rad):
          super(GraphMaxPooling, self).__init__()
          self.pool_rad = pool_rad
          self.graph = T.Compose([T.KNNGraph(9, loop=True), T.Cartesian(norm=False, cat=True)])

      def forward(self, data):
          cluster = voxel_grid(data.pos, data.batch, self.pool_rad,
                              start=data.pos.min(dim=0)[0] - self.pool_rad * 0.5,
                              end=data.pos.max(dim=0)[0] + self.pool_rad * 0.5)

          cluster, perm = consecutive_cluster(cluster)

          new_pos = scatter_('mean', data.pos, cluster)
          new_batch = data.batch[perm]

          cluster = nearest(data.pos, new_pos, data.batch, new_batch)
          data.x = scatter_('max', data.x, cluster)

          data.pos = new_pos
          data.batch = new_batch
          data.edge_attr = None

          data = self.graph(data)
          return data
rusty1s commented 5 years ago
ECC Weights and PyGeometric weights are equal: True
Loading weights
Starting validation:
ecc features conv1:  (997, 16)
pygeometric features conv1:  (997, 16)
Max difference between features of conv1 4.7683716e-07
ecc features conv2:  (997, 32)
pygeometric features conv2:  (997, 32)
Max difference between features of conv2 3.5762787e-07
Output of ecc pooling:  (398, 32)
Output of PyGeometric pooling:  torch.Size([47, 32])
Sum max_pool ecc 4054.4414
Sum max_pool pygeometric 509.8744
Difference between features of max_pool1 3544.567
Size positions ecc:  (398, 3)
Size positions pyg:  (47, 3)
Sum positions ecc:  16834.863076248246
Sum positions pyg:  1998.5325911267726
Difference positions:  14836.330485121474
Pygeomtric Acc:  63.65638766519823  Ecc accuracy:  63.65638766519823
Pygeomtric Loss:  0.9878960048312128  Ecc Loss:  0.9878960125771908
rusty1s commented 5 years ago

I'm not sure if the print statements are buggy, but accuracy and loss are the same for me.

dhorka commented 5 years ago

You are right, prints are buggy. The hook it's taking the value of the last max_pooling I do not know why. Thank you very much!

I would like to ask if you can explain me this better.

Ah! Yes, the pooling map in ecc is obtained by a nearest neighbor search on the coarsened positions, not based on voxel affiliation like in PyG.

I don't see the difference. When you are talking about voxel affiliation you mean all the pixels that are inside the voxel, right? With these points are estimated the new features and new positions, right? Isn't it the same that say that you are using the nearest neighbors?

Thanks,

rusty1s commented 5 years ago

Not necessarily, imagine two neighboring voxels with two resulting mean superpoints marked as x (the resulting coarsened points) and an outlier o:

+-----+-----+
|x    |     |
|    o|x    |
+-----+-----+

PyG max_pool pools the o into the left voxel, whereas ecc pools the o into the right voxel (because it is nearer to the superpoint in the right voxel).

dhorka commented 5 years ago

Oh I see!! Then for each point you find the nearest "superpoint" right?

rusty1s commented 5 years ago

Yes, we basically create a new cluster vector based on superpoint positions and initial point cloud.

dhorka commented 5 years ago

Then if I understood properly the whole process is:

You create the voxels using the voxel_grid. After that you estimate the "super point" for each voxel and then for each point you assign the nearest super point. Right?

What I am missing at this moment is the purpose of this function consecutive_cluster(cluster) .

rusty1s commented 5 years ago

Yes. The voxel_grid method creates non-consecutive cluster ids, e.g., [1, 5, 1, 5, 10]. The consecutive_cluster method redefines cluster ids to [0, 1, 0, 1, 2] so we can use basic scatter ops for pooling and coarsening.

dhorka commented 5 years ago

And the perm variable what is its value? Because consecutive_cluster is returning two values.

rusty1s commented 5 years ago

It has quite a strange name :D It holds one original idx for each cluster idx and is therefore used for filtering the batch indices.

dhorka commented 5 years ago

I see, thanks!! I think we can close the issue =D One last question, can I find the last version of torch_cluster and torch_geometric in pip repositories?

rusty1s commented 5 years ago

Cool :) torch-cluster is released in PyPi, and PyG will follow soon.

dhorka commented 5 years ago

Hi, I put here this problem because I am not sure if it is related or not. All the things that we are done here was working perfectly on the test that I provided. However in train phase I get this error after 30 epochs:

  File "/home/venv/graph/lib/python3.6/site-packages/torch_geometric-1.3.0-py3.6.egg/torch_geometric/nn/glob/glob.py", line 47, in global_mean_pool                                                                                                          
    # Patterns ending with a slash should match only directories
  File "/home/venv/graph/lib/python3.6/site-packages/torch_geometric-1.3.0-py3.6.egg/torch_geometric/utils/scatter.py", line 28, in scatter_                                                                                                                 
  File "/home/venv/graph/lib/python3.6/site-packages/torch_scatter/mean.py", line 68, in scatter_mean
    out = scatter_add(src, index, dim, out, dim_size, fill_value)
  File "/home/venv/graph/lib/python3.6/site-packages/torch_scatter/add.py", line 72, in scatter_add
    src, out, index, dim = gen(src, index, dim, out, dim_size, fill_value)
  File "/home/venv/graph/lib/python3.6/site-packages/torch_scatter/utils/gen.py", line 17, in gen
    index = index.view(index_size).expand_as(src)
RuntimeError: shape '[5543, 1]' is invalid for input of size 5544
srun: error: gpic10: task 0: Exited with exit code 1
=====================================================

In the training phase I am doing data augmentation, in this data augmentation I am doing a dropout of points, that means in each epoch there are different graphs of different sizes, also, in the same batch there are different sizes of graphs. I am not getting this error if I use the max_pool and cluster with default start and end provided in PyG, moreover the error is trigerred in the global_mean_pool... that seems weird to me.. Do you have any clue that what can be happenning?

rusty1s commented 5 years ago

Try adding the size argument to the global_mean_pool op.

dhorka commented 5 years ago

I tried but.. it does not solve the issue.

dhorka commented 5 years ago

Moreover, I executed an experiment that uses max_pool + cluster with the new start and end and global_mean_pool is working properly. I think the issue is related to the code new code used to calculate the max_pool.

rusty1s commented 5 years ago
data.x = scatter_('max', data.x, cluster, dim_size=new_pos.size(0))
dhorka commented 5 years ago

Oh I see, thats solves the problem. But I do not understand why..

rusty1s commented 5 years ago

It seems that in rare cases, newly computed clusters via nearest neighbor result in a different amount of clusters produced, e.g., there can be empty clusters, hence new_x and new_pos have different shape.

dhorka commented 5 years ago

It seems that in rare cases, newly computed clusters via nearest neighbor result in a different amount of clusters produced, e.g., there can be empty clusters, hence new_x and new_pos have a different shape.

We can know which cluster is empty? I mean, it should be better to suppress the empty cluster. Is it always the last one?

rusty1s commented 5 years ago

No its not, that's just the case when it crashes. You cannot suppress empty clusters, because officially they are there (they have a point in space), just with a zero representation.

dhorka commented 5 years ago

But, if any previous node is assigned to this cluster (using the nearest algorithm I mean), this new point doesn't have any feature and I think it is not interesting to have this node on my graph. How can I know which node has no representation? Because far as I understood the dim_size is adding 0 padding to the end of the feature vector, right?

rusty1s commented 5 years ago

I guess you need to recompute pos_new and batch_new (untested) based on the new cluster:

cluster = nearest(data.pos, new_pos, data.batch, new_batch)
data.x = scatter_('max', data.x, cluster)
data.pos = scatter_('mean', data.pos, cluster)
data.batch = torch.scatter(0, cluster, data.batch) 
dhorka commented 5 years ago

The calculation of new data.batch it is not working:

TypeError: scatter() received an invalid combination of arguments - got (int, Tensor, Tensor), but expected one of:

rusty1s commented 5 years ago
data.batch.new_empty(data.pos.size(0).scatter_(0, cluster, data.batch)

should do the trick.

dhorka commented 5 years ago
cluster size:,  torch.Size([993])
new batch size:  torch.Size([993])
x size:  torch.Size([257, 32])
new pos size:  torch.Size([257, 3])

These are the outputs of the size of each tensor. As you can see new_batch is not properly generated .. It is due to the fact that cluster size is 993.

rusty1s commented 5 years ago

The new data.batch has the shape of data.pos.size(0), which should be 257.

dhorka commented 5 years ago
def forward(self, data):
        cluster = voxel_grid(data.pos, data.batch, self.pool_rad, 
                            start=data.pos.min(dim=0)[0] - self.pool_rad * 0.5, 
                            end=data.pos.max(dim=0)[0] + self.pool_rad * 0.5)

        cluster, perm = consecutive_cluster(cluster)

        new_pos = scatter_('mean', data.pos, cluster)
        new_batch = data.batch[perm]

        cluster = nearest(data.pos, new_pos, data.batch, new_batch)
        data.x = scatter_('max', data.x, cluster)
        data.pos = scatter_('mean', data.pos, cluster)
        data.batch.new_empty(data.pos.size(0)).scatter_(0, cluster, data.batch)                                                                                                                                                                                                     

        print("cluster size:, ", cluster.size())
        print("new batch size: ",data.batch.size())
        print(" x size: ",data.x.size())
        print(" new pos size: ",data.pos.size())
        #data.pos = new_pos
        #data.batch = new_batch
        data.edge_attr = None

        #data.edge_attr = None
        #data = max_pool(cluster, data)
        data = self.graph(data)
        return data

This is the code that I am using, but I got a new batch with size 993

rusty1s commented 5 years ago

data.batch = data.batch.new_empty(data.pos.size(0)).scatter_(0, cluster, data.batch) :D

dhorka commented 5 years ago

Oh! you are complety right xDD After that I was getting this error: RuntimeError: invalid argument 3: Index tensor must not have larger size than input tensor, but got index [993] input [257]

I modified the code: data.batch = data.batch.new_empty(data.pos.size(0)).scatter_(0, torch.unique(cluster), data.batch) And then seems to work. But I get this error with radius_graph:

row, col = radius(x, x, r, batch, batch, max_num_neighbors + 1)
  File "/work/test_modelnet_ecc_pygeometric/env/pytorch11_geometric/lib/python3.6/site-packages/torch_cluster-1.4.2-py3.6-linux-x86_64.egg/torch_cluster/radius.py", line 61, in radius                                                                 
    max_num_neighbors)
RuntimeError: scan failed to synchronize: device-side assert triggered

Same error with KNN:

 File "work/test_modelnet_ecc_pygeometric/env/pytorch11_geometric/lib/python3.6/site-packages/torch_cluster-1.4.2-py3.6-linux-x86_64.egg/torch_cluster/knn.py", line 113, in knn_graph
    row, col = knn(x, x, k if loop else k + 1, batch, batch)
  File "/work/test_modelnet_ecc_pygeometric/env/pytorch11_geometric/lib/python3.6/site-packages/torch_cluster-1.4.2-py3.6-linux-x86_64.egg/torch_cluster/knn.py", line 57, in knn
    return torch_cluster.knn_cuda.knn(x, y, k, batch_x, batch_y)
RuntimeError: scan failed to synchronize: device-side assert triggered
rusty1s commented 5 years ago

No, you shouldn't use a unique call there. I do not understand the first error though, because before calling scatter_, data.batch and cluster should have equal shape.

dhorka commented 5 years ago

Sorry, it was a mistake in my code I found it. Now I do not need the torch.unique. But I am obtaining this error:


 File "work/test_modelnet_ecc_pygeometric/env/pytorch11_geometric/lib/python3.6/site-packages/torch_cluster-1.4.2-py3.6-linux-x86_64.egg/torch_cluster/knn.py", line 113, in knn_graph
    row, col = knn(x, x, k if loop else k + 1, batch, batch)
  File "/work/test_modelnet_ecc_pygeometric/env/pytorch11_geometric/lib/python3.6/site-packages/torch_cluster-1.4.2-py3.6-linux-x86_64.egg/torch_cluster/knn.py", line 57, in knn
    return torch_cluster.knn_cuda.knn(x, y, k, batch_x, batch_y)
RuntimeError: scan failed to synchronize: device-side assert triggered

The code used:

def forward(self, data):
        cluster = voxel_grid(data.pos, data.batch, self.pool_rad, 
                            start=data.pos.min(dim=0)[0] - self.pool_rad * 0.5, 
                            end=data.pos.max(dim=0)[0] + self.pool_rad * 0.5)                                                                                                                                                                                                       

        cluster, perm = consecutive_cluster(cluster)

        new_pos = scatter_('mean', data.pos, cluster)
        new_batch = data.batch[perm]

        cluster = nearest(data.pos, new_pos, data.batch, new_batch)
        data.x = scatter_('max', data.x, cluster)
        data.pos = scatter_('mean', data.pos, cluster)
        data.batch = data.batch.new_empty(data.pos.size(0)).scatter_(0, cluster, data.batch)

        print("cluster size:, ", cluster.size())
        print("new batch size: ",data.batch.size())
        print(" x size: ",data.x.size())
        print(" new pos size: ",data.pos.size())
        #data.pos = new_pos
        #data.batch = new_batch
        data.edge_attr = None

        #data.edge_attr = None
        #data = max_pool(cluster, data)
        data = self.graph(data)
        return data
rusty1s commented 5 years ago

Makes sense, you sadly need to call consecutive_cluster a second time due to possibly empty clusters :(

        cluster = voxel_grid(
            data.pos,
            data.batch,
            self.pool_rad,
            start=data.pos.min(dim=0)[0] - self.pool_rad * 0.5,
            end=data.pos.max(dim=0)[0] + self.pool_rad * 0.5)

        cluster, perm = consecutive_cluster(cluster)
        new_pos = scatter_('mean', data.pos, cluster)
        new_batch = data.batch[perm]

        cluster = nearest(data.pos, new_pos, data.batch, new_batch)
        cluster, perm = consecutive_cluster(cluster)
        data.x = scatter_('max', data.x, cluster)
        data.pos = scatter_('mean', data.pos, cluster)
        data.batch = data.batch[perm]
dhorka commented 5 years ago

Hmmm, I think I am not understanding properly the consecutive_cluster function. The thing is, Why is it not the same put the consecutive_cluster between:

cluster = nearest(data.pos, new_pos, data.batch, new_batch)
#cluster, perm = consecutive_cluster(cluster)
data.x = scatter_('max', data.x, cluster)
cluster, perm = consecutive_cluster(cluster)
data.pos = scatter_('mean', data.pos, cluster)
data.batch = data.batch[perm]

I asking that because I was thinking to do only this extra calculation in case of data.x.size(0) != new_pos.size(0)

rusty1s commented 5 years ago

Mh, that doesn‘t make sense. Either recompute node positions based on new cluster assignments, or allow zero feature representations for nodes. The thing is, the size mismatch is not a good condition for detecting empty clusters, because it does only detect a single empty cluster (the one at the end), and nothing else.

justanhduc commented 4 years ago

@dhorka @rusty1s could you please send me the script to reproduce ecc by torch_geometric?