Closed lukeb88 closed 5 years ago
Another 2 questions: 1) the paper also mentions using Gradient Clipping for language modeling, and I assume that this was done as part of the optimization of the Adam optimizer, but just to clarify, was this norm scaling? 2) the paper mentions gating as well, but is this implemented in the current keras version?
@lingdoc this is in my TODO list. I'll dig into that when I have a bit of time. In ~1 week from now!
btw, great implementation! I'm comparing it with a BiLSTM-CNN variant on a text classification problem and getting similar results with half the runtime.
That's great! Thank you :)
Just for info, I'm working on a PR to make the implementation identical to the paper: https://github.com/philipperemy/keras-tcn/pull/42
Merged. Only weight norm is missing https://github.com/philipperemy/keras-tcn/commit/141ef1c1fdfc2384aec5b24caa2e9624d486d9ef
@lukeb88 :
nb_stacks
):
I think I understood what you discussed here, but how this changes instead of just increasing the number of dilations?=> fixed I set the nb_stacks to 1 for all the examples. Results are the same.
how this changes instead of just increasing the number of dilations?
cf. https://arxiv.org/pdf/1609.03499.pdf
residual_block()
), why is there only one convolutional layer instead of 2 as in the paper?
It's just a simplification?=> Was a simplification. But I fixed it!
@philipperemy I supposed that the idea of stacking dilated convolution was taken from the Wavenet paper! Thanks for your clarification...
great work, btw!!!
@lukeb88 yeah the idea was taken from wavenet! Thanks I highly appreciate :)
I noticed some differences in the implementation of the architecture compared to the original paper.
I've already come across another issue, where @philipperemy was answering: "As the author of the paper stated, this TCN architecture really aims to be a simple baseline that you can improve upon. So personally I think replicating the exact same results is not that important.".
I just wanted to know if there are any reasons behind these two implementation choices that I don't understand:
1) In the paper, they don't talk about stacking various dilated convolutions (parameter
nb_stacks
): I think I understood what you discussed here, but how this changes instead of just increasing the number of dilations?2) In the single residual block (in
residual_block()
), why is there only one convolutional layer instead of 2 as in the paper? It's just a simplification?