keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.05k stars 19.48k forks source link

Dense layer doesn't flatten higher dimensional tensors #9813

Closed FirefoxMetzger closed 3 years ago

FirefoxMetzger commented 6 years ago

https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L776

The documentation of the dense layer claims to flatten the input if a tensor with rank > 2 is provided. However, what actually happens is that the the dense layer picks the last dimension and computes the result element wise along remaining axis.

https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L858

You can verify that by comparing two models one adding a Flatten() layer and the other one not adding one: https://gist.github.com/FirefoxMetzger/44e9e056e45c1a3cc8000ab8d6f2cebe

The first model only has 10 + bias = 11 trainable parameters (reusing weights along the 1st input dimension). The second model has 10*10 + bias = 101 trainable parameters. Also the output shapes are completely different. I would have expected the result to be indifferent wrt. the Flatten() layer...

It might very well be that I am misunderstanding something. If so, kindly point out my mistake =)

gabrieldemarmiesse commented 6 years ago

You are right, the code isn't consistent with the documentation. Could you do a PR removing this line in the docs? (we can't change the code as it would change the behavior of existing code).

FirefoxMetzger commented 6 years ago

I've created a PR as suggested. Thanks for the reply =)

gabrieldemarmiesse commented 6 years ago

Happy to help.

FirefoxMetzger commented 6 years ago

The PR got declined because

The doc is correct. The dot product only occurs along the last dimension of the input (which corresponds, conceptually, to a dot product with a flattened version of the input).

Unfortunately I don't understand this explanation. I've asked in the PR, but I don't think anybody will respond there. Thus I am reopening this issue in the hopes that somebody can help me understand. Here is my question again to save you switching threads:

I guess my question would be what 'flatten' means exactly in this context? (It clearly has no correspondence with how flatten is used in the Flatten() layer and that is what is confusing me [and apparently at least one other person]).

gabrieldemarmiesse commented 6 years ago

I must admit that I'm as confused as you are, haha. I think what that means is that the dot product is done on the last dimension. And that, one might think the dot product will always be done on the second dimension. But indeed, it's definitely not the flatten definition that I'm used to see in the keras documentation.

Well, hopefully, other confused people will see this thread.

gabrieldemarmiesse commented 6 years ago

Actually, I think a small comment might get accepted. like adding

Which means that the shape of the kernel will be (input_shape[-1], units).

to the note.

FirefoxMetzger commented 6 years ago

Actually, I think I get how it's meant. Doesn't mean it's correct, but I can see a picture that explains it :D

Understanding the dense layer as a contraction operator defined between a rank-m tensor A (which is the input) and a rank-1 tensor W (the weights) one has to specify the rank-index i along which the contraction operates. In this case we (arbitrarily?) choose last dimension; which one would expect, but I don't see any particular reason as to why, because we could define it along any other rank-index, too.

Performing that contraction we have to iterate over all other dimensions for all other ranks and perform a scalar product (dot product) between the "vector" defined by fixing all but the last index of A and W . This results in a new output tensor O of rank m-1 (the i-th rank-index is now scalar, so we don't count that anymore).

Thinking about this in the same way we "unroll" convolution layers into matrix multiplication, you end up with a "flattened" tensor (which looks like a matrix of dimensions (prod(dims of remaining ranks) x dim of i)) that is multiplied with a vector W. My problem with this image is that the "flattened" tensor is still a matrix (rank-2 tensor) not something of rank-1 which one would expect when hearing "flattened".

bethard commented 6 years ago

I just want to chime in that I was also confused by this documentation, as was this StackOverflow user: https://stackoverflow.com/questions/44611006/timedistributeddense-vs-dense-in-keras-same-number-of-parameters

If nothing else, I would really appreciate it if the note explicitly stated that "flatten" here means something different from the Flatten layer. Ideally, the documentation would give an example of an input of some shape and how that is "flattened" to produce the output shape.

rmanak commented 6 years ago

I am also confused with this, However applying a dense layer D[k,l] (of shape (K, L)to each of the temporal components of an input X[?,m,k] (of shape (?, M, K)) is mathematically identical to matrix multiplication X * D. This is just a happy coincidence. However for TimeDistributed layer to work with arbitrary layer, keras needs to have a "for loop" implementation of this multiplication rather than full vectorized implementation.

If input was flattened to the shape (?, M*K) the layer needs to have dimension (M*K, L) and far more parameters, this does not "corresponds, conceptually, to a dot product with a flattened version of the input" but does correspond conceptually to a dot product with flattened version where there are in fact M different copies of the dense layer of shape (K, L) so the temporal component do not share the weights. Perhaps that is what they meant by conceptual equivalency.