Closed FirefoxMetzger closed 3 years ago
You are right, the code isn't consistent with the documentation. Could you do a PR removing this line in the docs? (we can't change the code as it would change the behavior of existing code).
I've created a PR as suggested. Thanks for the reply =)
Happy to help.
The PR got declined because
The doc is correct. The dot product only occurs along the last dimension of the input (which corresponds, conceptually, to a dot product with a flattened version of the input).
Unfortunately I don't understand this explanation. I've asked in the PR, but I don't think anybody will respond there. Thus I am reopening this issue in the hopes that somebody can help me understand. Here is my question again to save you switching threads:
I guess my question would be what 'flatten' means exactly in this context? (It clearly has no correspondence with how flatten is used in the
Flatten()
layer and that is what is confusing me [and apparently at least one other person]).
I must admit that I'm as confused as you are, haha. I think what that means is that the dot product is done on the last dimension. And that, one might think the dot product will always be done on the second dimension. But indeed, it's definitely not the flatten definition that I'm used to see in the keras documentation.
Well, hopefully, other confused people will see this thread.
Actually, I think a small comment might get accepted. like adding
Which means that the shape of the kernel will be (input_shape[-1], units).
to the note.
Actually, I think I get how it's meant. Doesn't mean it's correct, but I can see a picture that explains it :D
Understanding the dense layer as a contraction operator defined between a rank-m tensor A (which is the input) and a rank-1 tensor W (the weights) one has to specify the rank-index i along which the contraction operates. In this case we (arbitrarily?) choose last dimension; which one would expect, but I don't see any particular reason as to why, because we could define it along any other rank-index, too.
Performing that contraction we have to iterate over all other dimensions for all other ranks and perform a scalar product (dot product) between the "vector" defined by fixing all but the last index of A and W . This results in a new output tensor O of rank m-1 (the i-th rank-index is now scalar, so we don't count that anymore).
Thinking about this in the same way we "unroll" convolution layers into matrix multiplication, you end up with a "flattened" tensor (which looks like a matrix of dimensions (prod(dims of remaining ranks) x dim of i)) that is multiplied with a vector W. My problem with this image is that the "flattened" tensor is still a matrix (rank-2 tensor) not something of rank-1 which one would expect when hearing "flattened".
I just want to chime in that I was also confused by this documentation, as was this StackOverflow user: https://stackoverflow.com/questions/44611006/timedistributeddense-vs-dense-in-keras-same-number-of-parameters
If nothing else, I would really appreciate it if the note explicitly stated that "flatten" here means something different from the Flatten
layer. Ideally, the documentation would give an example of an input of some shape and how that is "flattened" to produce the output shape.
I am also confused with this, However applying a dense layer D[k,l]
(of shape (K, L)
to each of the temporal components of an input X[?,m,k]
(of shape (?, M, K)
) is mathematically identical to matrix multiplication X * D
. This is just a happy coincidence. However for TimeDistributed
layer to work with arbitrary layer, keras needs to have a "for loop" implementation of this multiplication rather than full vectorized implementation.
If input was flattened to the shape (?, M*K)
the layer needs to have dimension (M*K, L)
and far more parameters, this does not "corresponds, conceptually, to a dot product with a flattened version of the input" but does correspond conceptually to a dot product with flattened version where there are in fact M
different copies of the dense layer of shape (K, L)
so the temporal component do not share the weights. Perhaps that is what they meant by conceptual equivalency.
https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L776
The documentation of the dense layer claims to flatten the input if a tensor with rank > 2 is provided. However, what actually happens is that the the dense layer picks the last dimension and computes the result element wise along remaining axis.
https://github.com/keras-team/keras/blob/aedad3986200b825d94f847d52bd6b81f0419a06/keras/layers/core.py#L858
You can verify that by comparing two models one adding a
Flatten()
layer and the other one not adding one: https://gist.github.com/FirefoxMetzger/44e9e056e45c1a3cc8000ab8d6f2cebeThe first model only has 10 + bias = 11 trainable parameters (reusing weights along the 1st input dimension). The second model has 10*10 + bias = 101 trainable parameters. Also the output shapes are completely different. I would have expected the result to be indifferent wrt. the
Flatten()
layer...It might very well be that I am misunderstanding something. If so, kindly point out my mistake =)