Closed YipingNUS closed 4 years ago
Hello @YipingNUS,
Indeed, your implementation is more straight-forward but essentially performs the same computation as far as I can tell. I recall that when I wrote this there was some issue with directly using the Dense layer and TimeDistributed wrapper was a good alternative at the time.
Hope that answers your question. In any case, please feel free to re-open the ticket if you have any more questions or require further clarification.
Best, Nikos
I guess previously why it didn't work without TimeDistributed is because you set the input dimension to V_doc._keras_shape[1]
. I'm not sure with the syntax of Keras, but tf.shape
will include the batch_size
in the first dimension. So the input size you specify might be num_labels
instead of the interaction layer dimension. I tried my code snippet above without specifying the input dimension and it works.
Why adding TimeDistributed will work is you're treating each label as a time step. The computation might be the same but it's conceptually a bit hard to understand. Anyway, thanks for your prompt reply!
You are right, it is counter-intuitive to use the TimeDistributed for this operation; I will update the code with your suggestion.
Thanks again for your feedback!
Hi @nik0spapp , I was studying the code and the paper and I have a doubt regarding the output layer. In the paper it would correspond to queation 11-13. In the code, it's the following:
My question is why there's a TimeDistributed wrapper in the output layer? In the paper, it mentioned the output layer is a linear unit (eq11) and a sigmoid transformation (eq13). My direct translation from the equation in the paper would be: