Differences with the paper

philipperemy / keras-tcn

Keras Temporal Convolutional Network.

MIT License

1.89k stars 454 forks source link

Differences with the paper #38

Closed lukeb88 closed 5 years ago

lukeb88 commented 5 years ago

I noticed some differences in the implementation of the architecture compared to the original paper.

I've already come across another issue, where @philipperemy was answering: "As the author of the paper stated, this TCN architecture really aims to be a simple baseline that you can improve upon. So personally I think replicating the exact same results is not that important.".

I just wanted to know if there are any reasons behind these two implementation choices that I don't understand:

1) In the paper, they don't talk about stacking various dilated convolutions (parameter nb_stacks): I think I understood what you discussed here, but how this changes instead of just increasing the number of dilations?

2) In the single residual block (in residual_block()), why is there only one convolutional layer instead of 2 as in the paper? It's just a simplification?

lingdoc commented 5 years ago

Another 2 questions: 1) the paper also mentions using Gradient Clipping for language modeling, and I assume that this was done as part of the optimization of the Adam optimizer, but just to clarify, was this norm scaling? 2) the paper mentions gating as well, but is this implemented in the current keras version?

philipperemy commented 5 years ago

@lingdoc this is in my TODO list. I'll dig into that when I have a bit of time. In ~1 week from now!

lingdoc commented 5 years ago

btw, great implementation! I'm comparing it with a BiLSTM-CNN variant on a text classification problem and getting similar results with half the runtime.

philipperemy commented 5 years ago

That's great! Thank you :)

philipperemy commented 5 years ago

Just for info, I'm working on a PR to make the implementation identical to the paper: https://github.com/philipperemy/keras-tcn/pull/42

philipperemy commented 5 years ago

Merged. Only weight norm is missing https://github.com/philipperemy/keras-tcn/commit/141ef1c1fdfc2384aec5b24caa2e9624d486d9ef

philipperemy commented 5 years ago

@lukeb88 :

In the paper, they don't talk about stacking various dilated convolutions (parameter nb_stacks): I think I understood what you discussed here, but how this changes instead of just increasing the number of dilations?

=> fixed I set the nb_stacks to 1 for all the examples. Results are the same.

how this changes instead of just increasing the number of dilations?

cf. https://arxiv.org/pdf/1609.03499.pdf

In the single residual block (in residual_block()), why is there only one convolutional layer instead of 2 as in the paper? It's just a simplification?

=> Was a simplification. But I fixed it!

lukeb88 commented 5 years ago

@philipperemy I supposed that the idea of stacking dilated convolution was taken from the Wavenet paper! Thanks for your clarification...

great work, btw!!!

philipperemy commented 5 years ago

@lukeb88 yeah the idea was taken from wavenet! Thanks I highly appreciate :)