pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.88k stars 3.6k forks source link

Issue reproducing the results of the original ecc implementation. Pooling layer and conv layer are giving different results of the original implementation #331

Closed dhorka closed 5 years ago

dhorka commented 5 years ago

As I mentioned in #319 I have problems to reproduce the ecc implemenation using pytorch_geometric. I found some differences between the results obtained, first one is that the results of both convolution operations using the same weights have different results. Moreover, the results of the pooling layers are also different.

I created a test that checks this things. Basically, the scripts load the same weights to both implementations. These weights are obtained from train a network using the ecc_implementation. Below you can see the output of my test.

ECC Weights and PyGeometric weights are equal: True #I am only doing a re-check in order to be sure that both weights are equal before to load to the models.
Loading weights 
Starting validation:
ecc features conv1:  (997, 16) #Shape of the output of first conv in ecc implementation
pygeometric features conv1:  (997, 16) #Shape of the output of first conv in pygeometric implementation
Max difference between features of first conv 2.549824
Output of ecc pooling:  (398, 32)
Output of PyGeometric pooling:  (385, 32)
Pygeomtric Acc:  41.51982378854625  Ecc accuracy:  63.65638766519823
Pygeomtric Loss:  2.435516586519023  Ecc Loss:  0.9878960176974138

As you can observe this difference has an impact to the accuracy using the same weights. You can find the source code here. One important thing, the data used for this tests is obtained from the original code of the ecc.

rusty1s commented 5 years ago

I will look into it. You could do me a huge favor by implementing a simple example to reproduce this. The current codebase is quite hard to analyze.

dhorka commented 5 years ago

The main code reproduce the output that I posted. Is it what you want? Maybe I can put some comments on the main code to make eassy the analysis.

rusty1s commented 5 years ago

Well, I haven‘t looked at it in more detail, but it would be nice to just have a basic script that compares the output of just the convolution on a mini example like a triangle graph.

dhorka commented 5 years ago

At this moment, the comparison it is done with a modelnet point cloud. The think is, that the structure of data used in the original ecc it is not trivial...

The script I provided, compares the output of the first convolution of both implementation and also compares the output of the first max pooling.

One important thing that I forgot to mention is that, you need to compile from source open3d and pytorch to avoid run time problems. As it is described here. Moreover, the code provided is adapted to pytorch 1.1.

EDIT: I added a req files to the repo and also I modified the ecc module to use jit to compile the extension.

yorard123 commented 5 years ago

Is there any news on this topic? I am facing the same issue. The same ecc architecture it is giving me less accuracy using the pyg implementation than the original implementation...

rusty1s commented 5 years ago

Actually, the installation of open3d in combination with torch is a real damper and I haven't had time to install both from source yet. Looking alone at the source code, I see no reason for the convolutions to perform differently, given the same input. Are you sure the input graphs are identically? Can you ensure that the ecc weights and the self.conv1.nn(data.edge_attr) call from PyTorch Geometric are the same (w.r.t. permutation-invariance for a given neighborhood)?

dhorka commented 5 years ago

Actually, the installation of open3d in combination with torch is a real damper and I haven't had time to install both from source yet.

As far as I remember I think building from source pytorch and installing from pip open3d is enough.

Looking alone at the source code, I see no reason for the convolutions to perform differently, given the same input. Are you sure the input graphs are identically?

The parameters of the knn, edge attr and so on are the same, but I did not check the graph structure to be honest, but if is it not the same, then there is a bug in the graph creation?

At the end, if you train a network using ecc implementation and pyg implementation the first one obtain always better results than the first one. I observed this on some datasets. Using the same seed, same parameters etcs. As a remember this issue is the second part of this one: #319

rusty1s commented 5 years ago

Well, you said the output of the first convolution yields different results, so it would be reasonable to check at which call the results start to differ. For example, knn graph generation is not necessarily deterministic (considering there are points with the same distance to other points). In addition, the implementations might differ in the k in knn producing k+1 or k neighbors when adding self-loops. I would be very pleased if you could verify that, so I do not need to compile PyTorch from source.

We can look at the different voxel grid implementations afterwards.

rusty1s commented 5 years ago

I managed to run it on my machine. A first analysis already yields some differences in the input representation:

I will dig deeper. However, the knn graph generation yields equivalent graphs.

rusty1s commented 5 years ago

After more investigation, edgecompaction=0 for the ECC model and conv1.flow = 'target_to_source' for the PyG model will yield equivalent results. This indicates that you need to swap the row and col vectors for the knn generation when using the default value of flow='source_to_target'.

rusty1s commented 5 years ago

For the voxel grid pooling, the implementations use different starting vectors for voxels. PyG starts its voxels at data.pos.min(dim=0)[0] by default (which is (0,0,0) for ModelNet), but I am not sure what open3d is using. The documentation does not say a word about this.

rusty1s commented 5 years ago

So either transpose your edge indices in PyG or use a different flow direction. Doing either of this should yield similar results as the official implementation. I will add a remark to the knn documentation.

dhorka commented 5 years ago

Hi, I was checking to change the flow of nnconv and I observed that the current implementation is not mapping the flow parameter of the message class. But, I created my custom nnconv adding tflow='target_to_source'. I re-run the experiment but unfortunatly I am not obtaining the same result after the first conv layer. Am I missing something?

dhorka commented 5 years ago

Solved, I forgot to adapt the cartesian coordinates. At this moment I am obtaining a difference between both implementations of: 4.7683716e-07 I think this is expected. I will check with more data, and I will come back. In the meantime can be possible to add a flag to the knn method that transposes the row and col? Then it will be possible to use the default cartesian method, right?

The new values of accuracy and loss using these changes are:

Pygeomtric Acc: 61.89427312775331 Ecc accuracy: 63.65638766519823 Pygeomtric Loss: 1.0270526386830254 Ecc Loss: 0.9878960071943925

I suppose this difference it is due to the difference between the implementation of open3d and pygeomettric, can be?

EDIT: Confirmed, the discrepancy is due to the difference between open3d implementation and pygeometric implementation. These are different outputsize of the voxel_grid + max_pooling: ecc features: (257, 32) pygeometric features: (246, 32)

rusty1s commented 5 years ago

Yes, there are different outputs of the voxel grid implementations due to different starting vectors. I will investigate that further.

The flag on knn is a good idea. Will do this. Then you can use the right Cartesian, yes. In addition, it is reasonable to pass kwargs to the Message Passing class.

dhorka commented 5 years ago

Yes, there are different outputs of the voxel grid implementations due to different starting vectors. I will investigate that further.

I was checking the code of open3d, I need to check in more detail but I think this is the file that we need to check https://github.com/intel-isl/Open3D/blob/20d8cc6ae41ed7f74d7bd374b70eef8005e45b4d/src/Open3D/Geometry/DownSample.cpp#L247

The flag on knn is a good idea. Will do this. Then you can use the right Cartesian, yes. Thanks!

In addition, it is reasonable to pass kwargs to the Message Passing class. I'm agree with that

rusty1s commented 5 years ago
Eigen::Vector3d voxel_min_bound = input.GetMinBound() - voxel_size3 * 0.5;
Eigen::Vector3d voxel_max_bound = input.GetMaxBound() + voxel_size3 * 0.5;

So instead of using start=pos.min(dim=0)[0] it uses start=pos.min(dim=0)[0] - size * 0.5. In your case, a simple start=-1.25 should do the trick :)

dhorka commented 5 years ago

In your case, a simple start=-1.25 should do the trick :)

Well, only in the first layer. I will check to add both start and end in the same way that it is done in open3d and I will come back with the result. Tomorrow morning I will tell you the results :) (I do not have gpu avaible at this moment) Thank you so much for your help!

dhorka commented 5 years ago

I was checking in cpu, and I have this error when I use this: start=pos.min(dim=0)[0] - size * 0.5 TypeError: add(): argument 'other' (position 1) must be Tensor, not list I think the problem it is in this line: https://github.com/rusty1s/pytorch_geometric/blob/9bb32e464600ad9a792c3b9d406ea8f46b5d2b48/torch_geometric/nn/pool/voxel_grid.py#L34

rusty1s commented 5 years ago

Use pos.min(dim=0)[0].tolist() instead. Will fix in master.

rusty1s commented 5 years ago

Fixed in master. You can now pass flow to any MessagePassing operator. In addition, knn_graph and torch_geometric.transforms.KNNGraph now have flow arguments with default source_to_target. You need to update torch-cluster to 1.4.1. Please check if everything works as expected. I will craft a new PyG release if everything works :)

dhorka commented 5 years ago

Yeps, tomorrow I will update my setup and I will check all. I achieved to execute the voxel_grid with the new start and end and these are the results:

ecc features conv3:  (398, 32)
pygeometric features conv3:  (398, 32)
Max difference between features of  conv3: 8.112797
Pygeomtric Acc:  62.334801762114544  Ecc accuracy:  63.65638766519823
Pygeomtric Loss:  1.0285567773858866  Ecc Loss:  0.9878960071943925

Conv3 is the convolution after the first max pooling using voxel_grid. We are close! Now there are the same number of points! But the difference it is high in this layer. Tomorrow I will update my setup in order to use the new updates, and I will modify the code to obtain the output of the first max pooling in order to compare the results.

rusty1s commented 5 years ago

Well, checking difference between feature maps after pooling is not well-defined, because ordering of nodes may be different. We need to first sort point clouds based on their position (in both implementations) before taking the difference.

dhorka commented 5 years ago

Well, checking difference between feature maps after pooling is not well-defined, because ordering of nodes may be different. We need to first sort point clouds based on their position (in both implementations) before taking the difference.

You are right! Then maybe it is only a slightly different due to the numerical instability in gpus, right? But I suppose we need to checked to be sure and close properly the issue.

I updated pygeometric and torch-cluster and I think there is an error. I think target_to_source and source_to_target are inverted. The first check that I done is execute the previous code with the new version, using the adapted cartesian and with the nnconv flow changed to target_to_source. By default knn has flow=source_to_target that I suppose it is the same than the previous version. Using that I am obtaining different results (bad ones) If I change the knn to target_to_source. Then I am obtaining the same results than with the previous versions. Moreover using default cartesian coordinates, nnconv with source_to_target and knn target_to_source it is obtaining bad results, but If i change knn to source_to_target it is working again. I think source_to_target and target_to_source are inverted, can be?

rusty1s commented 5 years ago

No, you need to use the default flag for knn, which is source_to_target (the one that matches the one of the message passing layer).

dhorka commented 5 years ago

Oh I see, then I misunderstood something. If the default value is source_to_target, at this moment the default behaviour of knn is different than the previous knn, right?

rusty1s commented 5 years ago

Yes, but it makes more sense this way and should prevent a lot of mistakes.

dhorka commented 5 years ago

Ok, now I understand.

I am trying to check to check if the pooling layers are obtaining the same results, but I have problems to recover the positions of the node of the original implementation..

rusty1s commented 5 years ago

Well, you can compare the mean or sum of positions/features in the node dimension. Alternatively, you can use something like sorting, e.g.,

idx = pos[:, 0] + 100 * pos[:, 1] + 10000 * pos[:, 2]
_, perm = idx.sort()
pos = pos[perm]
x = x[perm]

both for ecc and pyg positions/features separately.

dhorka commented 5 years ago

I will do the first one, you are right using the sum should be a good method to compare both outputs. I was wondering, are these changes also applied to the radius graph generator?

rusty1s commented 5 years ago

Yes, but it is not needed there, because the radius graph generation results in a symmetric adjacency matrix.

dhorka commented 5 years ago

After checking the ouputs of max_pooling of both implementation, I see there are difference.

ecc features conv1: (993, 16) pygeometric features conv1: (993, 16) Max difference between features of conv1 4.7683716e-07 ecc features conv2: (993, 32) pygeometric features conv2: (993, 32) Max difference between features of conv2 3.5762787e-07 Output of ecc pooling: (257, 32) Output of PyGeometric pooling: torch.Size([257, 32]) Sum max_pool ecc 2773.065 # sum of all features of the output of max pooling ecc Sum max_pool pygeometric 1158.84 # sum of all the features of the output of max pooling pyg Max difference between features of max_pool1 1614.225 # difference

As you can see there is a huge difference between the output of the max_pool layers. In contrast the first both convs are obtaining similar results.

rusty1s commented 5 years ago

Can you do this for the point positions as well?

dhorka commented 5 years ago

I was trying to do it, but I have troubles to recover the point positions of the ecc... I am not sure where are storing this information in the original implementation.

rusty1s commented 5 years ago

Ah, I see. Ok, I will look into it.

dhorka commented 5 years ago

I managed to recover the positions of ecc implementation. They were lost on this method. These are the results:

ecc features conv1: (993, 16)
pygeometric features conv1: (993, 16)
Max difference between features of conv1 4.7683716e-07
ecc features conv2: (993, 32)
pygeometric features conv2: (993, 32)
Max difference between features of conv2 2.3841858e-07
Output of ecc pooling: (257, 32)
Output of PyGeometric pooling: torch.Size([257, 32])
Sum max_pool ecc 2773.065
Sum max_pool pygeometric 1158.84
Difference between features of max_pool1 1614.225
Size positions ecc: (257, 3)
Size positions pyg: (257, 3)
Sum positions ecc: 7565.66983967046
Sum positions pyg: 7565.66983967046 Difference positions: 0.0

I will push the new code that I used to obtain these results.

rusty1s commented 5 years ago

Interesting, thank you very much. So the error obviously lies in max pooling node features, but why? :(

dhorka commented 5 years ago

I do not know :( I am cleaning a little bit the code(there are a lot of prints xD) before push it. I will try during the morning to change this max pool layer to an average pool and check if we have the same behavior. I will notify you when I push the code.

rusty1s commented 5 years ago

This is for CPU only? It might be good to test both, CPU and GPU.

dhorka commented 5 years ago

I'm testing on GPU. ECC in CPU is very slow.

dhorka commented 5 years ago

Confirmed, in avg_pool I obtaining the same problems using GPU:

Sum avg_pool ecc 2975.3457 Sum avg_pool pygeometric 1664.9924 Difference between features of avg_pool1 1310.3533 Size positions ecc: (398, 3) Size positions pyg: (398, 3) Sum positions ecc: 16834.863076248243 Sum positions pyg: 16834.863076248246 Difference positions: -3.637978807091713e-12

Maybe it is related with the flow? I am not sure if the max and avg operation are affected with the flow direction

rusty1s commented 5 years ago

Interesting, I look into this in a minute.

dhorka commented 5 years ago

Sorry for the delay, I had problems to push the code to git I do not know why :( Here you can find the code updated.

rusty1s commented 5 years ago

Thank you. For me personally, node features are not the same when using avg_pool and AGGR_MEAN respectively.

dhorka commented 5 years ago

What do you mean of using avg_pool and AGGR_MEAN?

rusty1s commented 5 years ago

I'm sorry, I misread your previous comment.

rusty1s commented 5 years ago

This seems like a bug in the ECC implementation.

dhorka commented 5 years ago

If we are sure about that, for my side that it is all. =)

rusty1s commented 5 years ago

Mh, cluster indices seem still to be different.

dhorka commented 5 years ago

Uhmm, positions are correct and indices are wrong?