token_embedding for non-text sequences

StolkArjen commented 3 years ago

Hi Peter,

Thanks for the insightful blog on how to build transformers from scratch. I'm experiencing what's more likely to be a user error than an actual code issue and was hoping you could provide me with a pointer on how to go about it.

In brief, I'm trying to perform sequence classification on multi-feature, non-text sequences. Specifically, each sequence is 5 features by 100 timepoints large and has one label. The data points include discrete locations in 2D space, cf. positions on a chessboard, and are all integers. The main issue probably resides in the fact that I'm not presenting the data correctly. During the first forward pass of the training data, when generating the token embedding (tokens = self.token_embedding(x) in Transformer), I'm getting:

File "xxx/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

What's unclear to me is whether this issue is due to mismatching tensor sizes or that my particular dataset is incompatible with the typical use case of nn.Embedding. For completion, self.token_embedding is 176 by 5, i.e. the number of unique rows/tokens within the dataset (hypothetically, the vocabulary size) vs. the number of features (hypothetical embedding size). Any pointers would be much appreciated.

Best, Arjen

pbloem commented 3 years ago

Hi Arjen,

Indeed, in this case, you'll need to rethink the input to the network. I can see two situations:

You have continuous features at each point in time (like weather measurements or stock prices). In this case, it's best to do without token embeddings and just use a Linear layer to project the features to a high dimensional space. You should still keep the position embeddings.
You have a small number of discrete features. This is the case, for instance in your example of a chess board. Here, you only have 64 squares, so you could just assign each an embedding. You can do this by assigning each square an integer and feeding a sequence of integers to the embedding layer.

In the last case, there's some structure to the input space that the network has to infer from the data (it's arranged in a 2d grid). You may need to design a more task specific way of embedding your tokens to explicitly provide this structure to the model. How to do this depends entirely on your task. Still, it's best to start simple. If you're lucky the network just infers the necessary structure from the data.

StolkArjen commented 3 years ago

Thanks, Peter, this helps to set me on the right track. The second option might not scale well in this particular instance given that the feature space also includes non-spatial information, such as the shape of the game piece.

As for the former, I suppose this would not be a matter of just swapping out nn.Embedding for nn.Linear, as seen in how the latter layer speaks of in/out features rather than embedding dimensions? Would one still need to specify the 'embedding size' and, if so, is it proportional to the input size (e.g., num features) or some arbitrary number like a hidden layer's neurons/nodes in the case of LSTMs?

For completeness, a single batch is torch.Size([80, 100, 5]); batch size max sequence length num input features. And there are 63 possible output classes.

pbloem commented 3 years ago

In the chess example, I would probably start by trying embeddings for all board positions and all game pieces and adding these together. That way you have a pretty small number of tokens (and you see each plenty of times), and you're still giving the model all the information it needs. Experience shows that adding together is sufficient as a way of combining embeddings, so nothing more complex is needed.

As for the continuous approach, all layers in the model expect input of (batch, sequence, emb) where emb is the embedding dimension. In your cas you can indeed simply feed your (80, 100, 5) tensor through an nn.Linear(5, emb) layer. The linear layer only operates on the rightmost dimension, and treats everything else as batch dimensions, so this does exactly what you want. The final layer of the model should then be nn.Linear(emb, 63) (either followed by a softmax or with softmax applied in the loss function).

StolkArjen commented 3 years ago

That did the trick for the continuous approach, nice! For a fun fact, increasing the learning rate to 0.005 from 0.0001 bumped the performance consistently from 45 to 85%, arguably the upper bound for this type of data. Who knew this small a number could make such a big difference.

As for the embedding approach, I'm afraid I still don't fully grasp your suggestion. In case you don't mind thinking along a bit further (your support has already been very helpful), the chessboard is actually 3 x 3 and there are two game pieces, one belonging to a "sender" player and another to a "receiver", each with a certain orientation (for a more lively idea, see www.MutualUnderstanding.nl/game). I'm inputting it as [x, y, angle, sender shape, receiver shape] per timestamp. The orientation of the receiver shape is encapsulated within the output classes since the receiver target configuration is what I'm trying to predict from the behaviors of the sender's shape. From your blog post, I can see how and why creating embeddings of movie ratings can be useful, ditto for word counts/vocabularies. But why and how would one create embeddings for all board positions separately? On a related note, would this help to gain a better understanding of what the transformer is actually learning/extracting from the data?

pbloem commented 3 years ago

I guess this might be a bit complicated to do with embeddings, the way I described earlier.

After thinking about it a bit, I'd probably describe a single board position as a (3, 3, n) tensor, where the first two dimensions match that of the board, and the third gives you a vector for each square describing what is happening in that square. For that vector, you can use 0s and 1s to encode all the information: use the first two elements to indicate which player occupies the square [0, 1] for sender [1, 0] for reciever, then two elements to indicate piece shape, then four to indicate orientation and so on.

To project a single board position to the model embedding dimension, you can just flatten the tensor and apply an nn.Linear(3*3*n, emb), or you can apply a small convolutional network so that the model has "access" to the grid structure.

Finally, to keep the model entirely transformer based, you could store the whole sequence as a size (time, 3, 3, n) tensor, apply an nn.Linear(n, emb), add position embeddings for the time and x, and y dimensions, and then flatten the whole thing into a (time*3*3, n) tensor and let the transformer take care of everything.

It depends a little on what exactly you're trying to learn, but my money would be on the first option with a small convolutional encoder.

StolkArjen commented 3 years ago

Thanks for thinking this through, Peter, really much appreciated. Just to verify whether my current approach overlaps with the spirit of your suggestion(s), I have it currently encoded as follows. Imagine the "sender" moving from the center of the game board (0,0) to the top left (-1,1), and rotating 90 degrees in place:

t0 = [0, 0, 0, s_shape, r_shape] t0+100ms = [0, 0, 0, s_shape, r_shape] t0+200ms = [-1, 0, 0, s_shape, r_shape] t0+300ms = [-1, 0, 0, s_shape, r_shape] t0+400ms = [-1, 1, 0, s_shape, r_shape] t0+500ms = [-1, 1, 0, s_shape, r_shape] t0+600ms = [-1, 1, 90, s_shape, r_shape]

Although the data is originally logged as timestamps, I've turned them into a timeseries with a 100 ms resolution. This seemed to work better with an LTSM-based approach I used previously as it allowed creating equal length (time-normalized) sequences. Perhaps for a transformer-based approach, I could go back to timestamp-based sequences, assuming that's what you're suggesting? For example, removing the time redundancy, the above matrix would become:

t0 = [0, 0, 0, s_shape, r_shape] t1 = [-1, 0, 0, s_shape, r_shape] t2 = [-1, 1, 0, s_shape, r_shape] t3 = [-1, 1, 90, s_shape, r_shape]

You're right that some of these dimensions could be collapsed further still. This is what I did for the receiver goal positions (which I'm trying to decode from the sender movements), converting them into nXnYnAnglenShape classes (for a total of 63). In fact, I could also turn the sender movements into a (timestamp, n) tensor, where n = 1, ..., 189 (63 3 possible sender/receiver shapes). This would be the most minimal representation of the data, but perhaps this is too abstracted away from the movement space/sequence? Or do you think the transformer doesn't care?

Overall, what I'm trying to learn here is what type of dependencies within and between movement sequences an artificial neural net might take advantage of, for overlaying with human performance. Within-sequence, participants of the game come up with solutions like "pausing" to signal a target location, or "wiggling", i.e. stepping out and back into a square to signal the receiver's target orientation. Across sequences, it gets more complex - that is, I don't know exactly how but would be keen to explore using a neural net. Hence, a read-out of the embedding/fc-layers or an attention heat map of some sort would be in the crosshair, in case you happen to have a suggestion. Finally, this is currently just a hobby/exploration project, but in case it turns into something more I'd happily invite you on board (as a co-author) if that's something you'd also be interested in.

p.s. I forgot to address your remark about CNNs. This is a good point since CNNs might be better able to provide that desired "read-out"? I was hoping, however, to get as close as possible to the state-of-the-art, i.e. transformers, in case there'd be an opportunity to translate insights from our communication data to the real-world, which is not 3x3 shaped. ;)

pbloem / former

token_embedding for non-text sequences #18