the transformer to be applied to classification

maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.

https://timeseriestransformer.readthedocs.io/en/latest/

GNU General Public License v3.0

852 stars 166 forks source link

the transformer to be applied to classification #18

Closed hongjianyuan closed 4 years ago

hongjianyuan commented 4 years ago

How should I change the transformer to be applied to classification, such as seq2seq (many to many), how should I change it in the last layer of the model

maxjcohen commented 4 years ago

Hi, I believe the most straight forward solution would be to keep the original architecture, and only change the output module. Currently, I have a linear transformation followed by a sigmoid activation, I would start by simply replacing the activation with a softmax, and see from there.

hongjianyuan commented 4 years ago

I currently want to input 250 features, segment them, and output the categories of these 250 features, so I just need to change the output module to softmax?

maxjcohen commented 4 years ago

Yes, set d_input=250, d_ouptut to the number of class, and replace the sigmoid by a softmax, you should have a functional segmentation algorithm.

hongjianyuan commented 4 years ago

Thank you very much

hongjianyuan commented 4 years ago

是的，设置d_input=250，d_ouptut上课的人数，并通过SOFTMAX更换乙状结肠，你应该有一个功能分割算法。

If it is the category of these 250 features, the output is like 250*4

MJimitater commented 3 years ago

Hi @maxjcohen , thanks for your great repo!

Is it possible to change the transformer to understand sequence classification (many-to-one)?

maxjcohen commented 3 years ago

Hi, nothing is stopping you from setting d_output = 1, in order for the Transformer to behave as a many-to-one model. In practice, every hidden state will be computed with a dimension d_model, and later aggregated in the last layer to output a single value. Note that this process in different from how traditional architectures, such as RNN based networks, handle many-to-one predictions.

MJimitater commented 3 years ago

Thank you for your reply @maxjcohen ! How exactly do you mean its different? From the way a RNN-model would take hidden states as further input?

maxjcohen commented 3 years ago

RNN carry a memory-like hidden state across time steps, while the Transformer has no notion of memory and compute time steps in parallel instead.