Separate embedding parameters from model parameters?

Chillee commented 4 years ago

I understand that the line between embedding and model parameters could potentially get somewhat fuzzy, but I think the current state of "parameters" is somewhat confusing.

The original request (#39) was for "system metrics" - in other words, to get a sense of 1. how complex/expressive the model is, and 2. how long it is to train. Presumably for ease of reporting, OGB eschewed metrics like runtime for parameters + hardware.

However, for the case of embeddings, I don't think the current parameters metric corresponds to anything meaningful. For example, on Arxiv, the node2vec embedding has 21.8 million parameters, in contrast to GCNII, which has 2 million parameters. I don't think this accurately represents any notion of either complexity nor runtime.

Things get even more confusing when you're stacking models on top of the embedding parameters. For example, on many of the link prediction tasks, the majority of the parameters goes towards the embeddings, with a relatively low amount of parameters going towards the model. This means that on leaderboards like DDI, one can increase the model complexity significantly without increasing the total number of parameters by that much. This also means that, for example, one can double the number of model parameters used while halving the number of embedding parameters, while displaying a significant drop in overall parameters.

In summary, I think that in the presence of embeddings, the current "parameters" metric doesn't meaningfully capture the aspects of the method we care about. The simplest proposal I can think of is to include the embeddings parameters separately from the model's parameters (if applicable). For example, you could include it like 123,456 (1,234,567)

weihua916 commented 4 years ago

Thank you for your insight and suggestion. We will discuss this internally and get back to you.

weihua916 commented 4 years ago

After internal discussion, although we think your idea is interesting, we have decided to keep the way it is to maintain simplicity.

Chillee commented 4 years ago

I can understand that sentiment. In that case, can you clarify what a "parameter" is? In particular, I've been wondering why SGC doesn't count as parameters while node2vec does. While it may be more clear in the case of SGC (as it is simply an exponentiated adjacency matrix), it becomes less clear as you move to using, say, graph diffusions as embeddings, which can be formulated both iteratively as well as as the minima of an optimization problem.

You can have similar "ideas" for models - for example, you can formulate a linear model (on raw features) as a sequence of matrix inverses/multiplies instead of gradient descent. In that case, however, the spirit of the rule seems more clear.

I hope I'm not being nitpicky with these questions - in my case, some submissions I would like to make involve graph diffusions (sometimes computed with gradient-based optimization, sometimes computed with iterated methods). Should I consider graph diffusions computed using optimization to be extra "parameters", while graph diffusions computed using iterated methods to be something more similar to SGC?

weihua916 commented 4 years ago

I am not sure if I understand your method well, but if you can generate your embeddings without torch.nn.Embeddings + gradient-based optimizer, then it's likely that you do not need to count those in your number of parameters.

weihua916 commented 4 years ago

Practically, the question is whether the underlying parameter optimization problem is trivial or not. node2vec is highly non-trivial (thus, we include the embeddings as parameters), while SGC (if there is any underlying optimization objective) is trivial (thus, zero parameters).

Chillee commented 4 years ago

I think that's a fairly understandable rule, and I don't have a problem with it :+1: . I still have my concerns about the combined parameters providing an inaccurate indicator of model complexity, but I can see how clarity of the leaderboard would outweight those concerns.

Re: my point about diffusion being computable with both an iterative method as well as a gradient-based method, see from https://arxiv.org/abs/2006.04762.

Equation 1 is the common iterative method for computing diffusions, but Equation 3 is an equivalent optimizer based method. In this case, one would have to take care to use the iterative method if they wanted to reduce the parameter count.

For some more context about why I asked the question, the issue that spurred me to ask it is the issue of using spectral embeddings. To compute spectral embeddings, the primary work is done by calling out to a sparse eigensolver. However, my concern was that some sparse eigensolvers use conjugate gradient based methods. I was wondering whether I was in the position of needing to understand which sparse eigensolver the library I was using called out to, and figuring out what methods they were using.

One final question: One can use a logistic regression model on many of these tasks. Considering that logistic regression can be solved without the use of gradient descent, would a model fit in that manner be considered to have 0 parameters? My opinion would be no, but just wanted to verify.

EDIT: You commented your most recent comment as I was writing this. I totally agree with your overall motivation - apologies if my questions sounded tricky for the sake of being tricky. There is indeed some grey area about whether the problem is "trivial", but I think that's completely acceptable and resolvable on a case by case basis.

weihua916 commented 4 years ago

Great. I think the exact calculation of spectral embeddings is quite computationally expensive and non-trivial, so they should be counted as parameters.

Chillee commented 4 years ago

haha, now I disagree with you again :rofl:. I think the underlying computational objective is very simple conceptually, and the primary work is being done by an eigensolver. In fact, if you want, you can compute your eigenvectors with something analogous to our method for diffusions as well - in the above iterative method for diffusion y_i = (\alpha*A*y_{i-1}) + (1-\alpha)*y_0, we can simply compute it with alpha=1.0 to obtain an eigenvector, and then use deflation to repeat and compute other eigenvectors.

I thought that it wouldn't be counted under your previous statement as well

I am not sure if I understand your method well, but if you can generate your embeddings without torch.nn.Embeddings + gradient-based optimizer, then it's likely that you do not need to count those in your number of parameters.

I think "non-trivial" is a better judgement than non-computationally expensive, as well. For example, MLPs can be trained on papers100M before you can do the 3 matrix vector multiplies for SGC. That doesn't make MLPs more computationally expensive than

Finally, spectral embeddings are not that computationally expensive either. Computing one for papers100M can be done in about the time it takes to do 50 matrix vector multiplies (which is what we normally do for diffusion).

weihua916 commented 4 years ago

I see your point, and it makes sense. After all, the number of parameters is tricky and may not really be a good metric, especially in node/link property prediction tasks. I think it is still good to be there, but people should be aware of some grey zone and may not care about it too much. If you want to claim your method is better, it is better to be demonstrated via runtime and memory consumption if that's possible.

snap-stanford / ogb

Separate embedding parameters from model parameters? #61