locuslab / TCN

Sequence modeling benchmarks and temporal convolutional networks
https://github.com/locuslab/TCN
MIT License
4.11k stars 876 forks source link

Is there any reason to use TCN over 1D-CNN? #57

Closed johnsyin closed 3 years ago

johnsyin commented 3 years ago

I find that in the following paper, the author uses 1D-CNN to replace the causal convolutions.

Wan, R., Mei, S., Wang, J., Liu, M., & Yang, F. (2019). Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting. Electronics, 8(8), 876.

To quote them:

Causal convolution is used to assume that all data must have a one-to-one causal relationship in chronological order. ...

While x1 and x5 may have a direct logical connection, causal convolution will make the relationship between x1 and x5 affected by x2, x3, x4. This design was limited by the absolute order of time-series and inefficient for accurate characteristics learning at a relative time.

Is there some obvious advantage of TCN I am missing here? For arbitrary length/flexible receptive field size, I think it can be done with large 1D-CNN or padding. And for

jerrybai1995 commented 3 years ago

In my opinion, this argument you quoted is a strange argument in multiple aspects. Here are my thoughts:

  1. The purpose, and core, of causal convolution is not the "one-to-one causal relationship in chronological order". Causal convolution is only a halved 1D convolutional kernel to ensure there is no information leakage--- other than that, there is no difference from a 1D convolution. So I don't see why the authors try to differentiate one from the other. For example, we can simply write a causal convolution as:

    layer = nn.Conv1d(..., ..., kernel_size=3)    # This is a 1D conv layer!!
    y = layer(F.pad(x, (1,0)))

    When future information leakage is not a big issue (e.g., the encoder of a machine translation sentence), then TCN is simply a canonical 1D convolution. Don't try to separate one from another.

  2. With dilated convolutions in the TCN, the quote "the relationship between x1 and x5 affected by x2, x3, x4" will simply be incorrect (and the authors themselves also claimed this only for their "Figure 3", which adopted non-dilated, kernel=1 convolutions). Specifically, I would say the relationship between x1 and x5 will be affected by all paths from x1 to x5. For example, through dilation, some paths could go from x1 to x5 directly (e.g., in the intermediate layers). Some other paths could be x1-->x2-->x5.

The purpose of gradient-based learning is to make sure that the correct relationship between x1 and x5 is identified and learned. In fact, a similar argument can also be made on 2D convolutions and images, but we all know that CNNs work very well on vision tasks.