When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

locuslab / TCN

Sequence modeling benchmarks and temporal convolutional networks

https://github.com/locuslab/TCN

MIT License

4.17k stars 879 forks source link

When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN. #39

Closed KinWaiCheuk closed 4 years ago

KinWaiCheuk commented 4 years ago

It seems to me that LSTM is faster when the sequence length is short (say 28). When the sequence length is long (say 784), LSTM will be much slower than TCN.

It seems to me for TCN, the computation time is independent of the sequence length.

Am I correct?

jerrybai1995 commented 4 years ago

Not necessarily, LSTM is slower than TCN on long sequences because recurrent networks process the tokens sequentially whereas TCN can perform convolution operation in parallel. However, when the sequence is long enough, you should still expect a slowdown because you only have limited CUDA kernels (or CPU compute).

KinWaiCheuk commented 4 years ago

But when I am trying to do the sequential MNIST in 28X28 fashion (each sequence has a length of 28 and 28 sequences in total), LSTM is much faster than TCN.

Here's my training for LSTM, which takes only 13 seconds for each epoch Here's my training for TCN, which takes almost 40 seconds for each epoch.

Am I doing anything wrong here?

jerrybai1995 commented 4 years ago

Nope, I think it depends more on your dilation configuration, batch size, # of parameters, # of LSTM layers and the compute resource you use than merely the architectural differences. However, you should expect good parallelism from TCN, which offers great advantages as the seq length gets longer.

KinWaiCheuk commented 4 years ago

I see your point. When the seq length is longer, I do see the advantage of TCN being able to parallelize.

One last question, in your paper, did you compare TCN on 784X1 sequential MNIST to LSTM on 28X28 sequential MNIST? My LSTM has a really poor performance when training on 784X1 sequential MNIST. Basically it doesn't learn, its accuracy is only around 0.12.

jerrybai1995 commented 4 years ago

Try to tune the forget gate bias. I think it should reach an accuracy of about 90%.

KinWaiCheuk commented 4 years ago

I have been trying to replicate the same result that you reported on your paper. But the LSTM results are always worse that what you reported.

After initializing the forget gate bias to 1, I did get a better result. But I am still unable to get as good as 90% accuracy in 20 epochs. I have added gradient clipping to 1 and use RMSprop as the gradient descent. But I can only get at most 85% accuracy.

Here is my code. And I missing anything?

jerrybai1995 commented 4 years ago

No, I think you are doing things correctly. Don't use RMSprop, I think Adam would work just fine. You can try to tune gradient clipping as well.