GMMConv uses MessagePassing Layer with mean aggregation

bentaculum commented 5 years ago

🐛 Bug

In the recently updated version of GMMConv, it inherits from its parent class MessagePassing and is initialized with mean aggregation.

Expected behavior

To my understanding, the aggregation function should be add, as the aggregation is summing up over the neighboring vertices, not normalizing the gaussian mixture components as stated in a comment in the code.

rusty1s commented 5 years ago

Yes, you are right. Those are two different things. Sorry for the inconvenience. Will fix it ASAP.

rusty1s commented 5 years ago

Hi, I just pushed a new GMMConv to master which fixes this bug. In addition, I added a root_weight argument to the operator to align better with other operators.

bentaculum commented 5 years ago

Hi again, thanks for getting to this so quickly :) I checked your update and have a few comments:

I see that it could theoretically be beneficial to normalize the sum of the mixture components per edge, but I think it's not necessary. Furthermore, as we already use unnormalized gaussians, I don't see the point of then normalizing the sum of them and would suggest dropping the normalization entirely.
The normalization you introduced now DOES NOT normalize the mixture components per kernel and edge. I think you can't even do that from solely looking at a single sample from the prob density function. It normalizes the mixture outputs from all edges per node for each mixture component (edit: I'm not even sure how to put that in words), which does not make any sense. I found this by executing gaussian.sum(dim=0) at a breakpoint on line 100 in gmm_conv.py, which returns a tensor where all entries (edit: almost all, there are apparently some numerical issues) are the same, namely the number of nodes in the batch.

rusty1s commented 5 years ago

Hi, Why shouldn‘t this make any sense? In the end, we want to handle variable sized neighborhoods, and this can only be achieved by normalization per node.

bentaculum commented 5 years ago

As far as I understand, I would choose the mean function as an aggregation function per node if I wanted to treat variable-sized neighborhoods the same way. As of now, I think you normalize all the weights per node in the message, but then sum up (edit) the dotproducts afterwards with add aggregation in the propagate. I don't see this normalization in the MoNet paper. Again, I see the point of handling variable-sized neighborhoods the same, but for some other applications you might want the exact opposite, e.g. counting direct neighbors (or nodes n hops away) within a certain euclidian distance.

bentaculum commented 5 years ago

So basically my question is, why do we need this "pre-normalization of the weights" per edge? I would say that the normalization by # of edges can happen in the aggregation if desired, or not at all in case I want to use sum aggregation instead.

rusty1s commented 5 years ago

Ok, I agree. I changed GMMConv once again. You can now pass the wished aggr as an argument ("add" by default). This should now more align with its formulation in the paper. WDYT?

bentaculum commented 5 years ago

Looks good to me :) I took a quick look at the implementation you reference and didn't really get why they are doing the normalization there. Do you understand it?

rusty1s commented 5 years ago

This implementation is the reason I was doing it in the first place. But after some thoughts I must agree with you. This does not make any sense at all. However, it ensures that features will get activated, which might help training. In the end, this operator does not train well at all. I am not sure why. I guess it might be due to numerical instabilities or wrong initializations of the gaussians. There are many operators which are just way easier to train.

bentaculum commented 5 years ago

That's what I've found as well, especially the means of the Gaussians have to be initialized in the right spots, otherwise the gradient descent will quickly "push" them out of the feasible region of the pseudo-coordinates. I tried switching to your SplineConv, which I considered much more stable, but encountered instable training as well. I don't understand what could go wrong with the splines, did you encounter some problems due to e.g. initialization as well at some point?

rusty1s commented 5 years ago

What do you mean with instable training? High fluctations or a single random drop in accuracy? I do noticed the latter from time to time, although I am absolutely unsure what causes this issue. For training SplineCNN, its all about using the right normalized pseudo-coordinates due to uniform distribution of knot positions. That is exactly what the KPConv is trying to fix.

bentaculum commented 5 years ago

I have built a simple, yet different from most GNN benchmarks, toy task for testing different Graph Neural Net models. When using only a single graph convolution, I expect the learned kernel to be some sort of thresholding function on the pseudo-coordinates, then the sum over the neighborhood can do the counting.

I was able to achieve accuracy of 85% with MoNet (on ~6-10 classes), but this was very sensitive to the initialization of the Gaussians and certain hyperparams. I replaced MoNet with SplineConv, hoping it would be less prone to these problems, but could not even replicate the results I obtained with MoNet, being stuck at roughly 40% of the accuracy. I don't understand how the SplineConv could be less powerful than the MoNet.

rusty1s commented 5 years ago

This sounds strange. In practice, accuracy should be close to 100% on this simple toy task. Which pseudo-coordinates do you use? Distance? What is the accuracy on the training set?

bentaculum commented 5 years ago

Yes, I completely agree, both train and test accuracy should be close to 100%. Concerning your questions from above:

pseudo coordinates: Cartesian, with global scaling to interval [0,1]
training accuracy is similar to the test accuracy, +- 5 percentage points

Have you been able to successfully tackle tasks with GNN that heavily rely on the pseudo-coordinates?

rusty1s commented 5 years ago

You should check out transforms like Distance or Polar, because Cartesian might not be the best choice for this task given the fact that it is only able to threshold a square, not a circle (due to uniform knot positions). Our SplineCNN experiments mostly solely rely on pseudo-coordinates, e.g., in the FAUST experiments, there is no other information than the spatial relation of neighboring points.

pyg-team / pytorch_geometric

GMMConv uses MessagePassing Layer with mean aggregation #248

🐛 Bug

Expected behavior