jnhwkim / cbp

Multimodal Compact Bilinear Pooling for Torch7
Other
68 stars 23 forks source link

x and y must be of same length? #2

Closed lichengunc closed 8 years ago

lichengunc commented 8 years ago

Does current version only supports the bilinear pooling of two same-length vectors? Is it possible to support the pooling of (n, dimA) and (n, dimB) --> (n, compact_dimension) ?

chingyaoc commented 8 years ago

Hi, I modified the code to make it support different input dimension, check it! https://github.com/JamesChuanggg/cbp

lichengunc commented 8 years ago

That's so fast. Thank you very much.

lichengunc commented 8 years ago

Just a side question, it seems outer product (e.g., bilinear pooling) is better than vector concatenation. But how about its comparison with element-wise vector multiplication? as both are ways to interact two vectors and both appears in some recent papers. One thing for sure is for dimension the element-wise operation is better. Do you have any suggestion on the comparison of the two ways?

jnhwkim commented 8 years ago

@JamesChuanggg great! @lichengunc Fukui et al. (2016) compare those two different operations on visual question-answering task, and they show compact bilinear pooling is superior. However, in general, it's an open question I think.

iamaaditya commented 8 years ago

@lichengunc It is certainly the case that 'outer product' is better than elementwise multiplication. It is simply because 'outer product' is more expressive. Think of this simple example, where we have only three feature each.

Image feature = [a b c] Question feature = [x y z]

When you do element-wise multiplication, you get three features like this ---

[xa by cz]

So you have three features but they are in different scale thus if another case where say x was multiplied with 'm' it would give different number. This is good, and better than concatenation as you get different values, but see how it compares to outer product

[a b c] * [x y z]*
---gives--- [ax ay az] [bx by bz] [cx cy cz]

here not only you are saying that feature 'a' be combined with 'x' but you are saying all the image feature should be combine with every question feature. This gives your square the number of features and is certainly very good as more features have more expressive power. But as you can see this comes with a cost. If you had 1000 features each then you would have 1000 * 1000 features now, which is very hard to manage even with latest hardware.

Ingeuinity of the paper quoted by @jnhwkim is that they figure out how use compact bilinear pooling (another paper which introducted this technique) in order to reduce the size of outer-product.

HTH

lichengunc commented 8 years ago

Thank you guys for the detailed discussion! :-)

jnhwkim commented 8 years ago

@iamaaditya Hi, there! How about a practical view point? It may be another perspective, I am of the opinion that bilinear model may introduce overfitting since this model assume full interactions between features. May gradient flows diverge among them. Though I'm not questioning on recent successful papers, I want to understand it in a practical way.

tejaskhot commented 8 years ago

@jnhwkim Could you please explain why it could overfit with an example, along with your comment on the diverging gradient flows.

jnhwkim commented 8 years ago

@tejaskhot Sorry for not enough explanation. Bilinear model considers interactions among features, # of hidden parameters is quadratic, even if compact bilinear pooling is applied. This may be one of reasons to overfit without an appropriate regularization (L2 Layer Normalization or SignedSquareRoot maybe possible solutions).

Let x=(x1, x2) and y=(y1, y2) be two inputs. The outer product of these two vectors gives (x1y1, x1y2, x2y1, x2y2). The gradient of this with respect to x is (y1+y2, y1+y2) and with respect to y is (x1+x2, x1+x2). Then, output gradient d=(d1,d2,d3,d4), which is back-propagated from next layer or loss function, is back-propagated to inputs as (y1d1+y2d2, y1d3+y2d4) and (x1d1+x2d2, x1d3+x2d4), respectively. Eventually, each element of x is updated by all elements of y and a part of d, whose dim is the same with y.

Though I try to carefully describe, please let me know if I miss something.

iamaaditya commented 8 years ago

@jnhwkim Sorry for late reply.

You are right in saying that the quadratic explosion in the number of features will certainly lead to overfitting if proper measures are not done. However, consider two things :-

  1. It is better to first take all the features, and use only some if overfitting is an issue instead of starting with fewer number of features. One way would be to do PCA or Self-Organizing Maps.
  2. In the Berkeley paper, they use Count Sketch Projection, which reduces the dimension of features from R^n to R^d, where 'd' is significantly smaller than 'n'. This allows them to avoid overfitting.

@tejaskhot

Overfitting occurs every-time when the number of features increases without proportional increase in the number of training data. If one considers simple bilinear pooling, then a standard 2048 dim features (dim size of GoogleNet/ResNet) is converted to 12.5 billion features. However, the dataset has only 1 Million entries. Thus each feature could just learn one example from input, it would achieve 100% accuracy in training without ability to generalize.

In the Berkeley paper they use 'compact bilinear' which reduces the number of features. They use smaller size 'd' instead of 'n' (full bilinear). It is interesting to note that with d=32K accuracy is lower than d=16K, which tells you that overfitting starts are very low number of features (by low I mean when comparing it with 12.5 billion). One can only imagine if d=1Billion (or 12.5 Billion for full features), the results would be terrible.