[MAJOR] Revisit model architecture

It is time to revisit the architecture to ensure there is a clear understanding of what transformations are taking place with respect to multi-variate time series data and what analogies one can draw in relation to the Transformer architecture that is used for sequence modelling of speech.

It was previously thought that by windowing the input sequence, it would be these windows that would be attending to one another, but this was mistaken. It is each time step, or each item at each time step that attends to every other item in the sequence, including itself.

It was also previously thought that by convolving the inputs of input shape (BATCH_SIZE, timesteps, num_features) that this would "preserve temporal information". The (wrongful) reasoning behind this was caused by reading in "Hands-On ML" book:

Keras offers a TimeDistributed layer ... it wraps any layer (e.g., a Dense layer) and applies it at every time step of its input sequence. It does this efficiently, by reshaping the inputs so that each time step is treated as a separate instance (i.e., it reshapes the inputs from [batch size, time steps, input dimensions] to [batch size × time steps, input dimensions];

The Dense layer actually supports sequences as inputs (and even higher-dimensional inputs): it handles them just like TimeDistributed(Dense(...)), meaning it is applied to the last input dimension only (independently across all time steps). Thus, we could replace the last layer with just Dense(10). For the sake of clarity, however, we will keep using TimeDistributed(Dense(10)) because it makes it clear that the Dense layer is applied independently at each time step and that the model will output a sequence, not just a single vector.

Note that a TimeDistributed(Dense(n)) layer is equivalent to a Conv1D(n, filter_size=1) layer.

What became apparent is that just because the convolution is applied at each time-step individually, this has no bearing on temporal information being preserved, and in fact from "Attention is all you need" paper, the state:

3.5 Positional Encoding Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.

Therefore, it is believed that Positional Encoding is required to "bring back" the temporal information that is originally present.

The Transformer architecture is whilst used for input sequences of words (sentences), the analogy can be draw to Astrophysical transients in the sense of a light curve being a sentence, and 6-D observations at each time step being equivalent to words. Considering the EncodingLayer only for now, the encoder takes input a batch of sentences/light-curves represented as sequences of word IDs/6-D observations (the input shape is [batch size, max input sentence length]), and it encodes each word/6-D observation into a 512-dimensional/d-model representation (so the encoder’s output shape is [batch size, max input sentence length, d-model]).

So, for our model, it will take in a full light curve, consisting of N-timesteps for each object. It will then apply a convolutional embedding to each timestep to transform the data from [batch size, N-timesteps, 6-D] --> [batch size, N-timesteps, d-model]. From here, a positional encoding will be calculated using trigonometric functions to determine the position for each observation in the sequence. These are then summed together to produce an input of shape [batch size, max input sentence length (==N-timesteps), d-model]. At this point, the EncodingLayer will process this input through the multi-head self attention layers as well other layers.

Going forward, the items that are required are to first implement a PositionalEncoding class. Following this, a refactor of the architecture as a whole will need to be looked at. Furthermore, a look into the plasticc data preprocessing that created the windowing; this should be revised to 100 (N x GPs) where an input is a whole sequence, i.e a single light-curve for a single object.

TODO

[x] PositionalEncoding class
[x] Refactor model.py drawing examples from Hands-On book and tensorflow documentation: https://www.tensorflow.org/tutorials/text/transformer
[x] Ensure parquet file is correct with "appropriate" windowing taking place. Perhaps reduce number of GPs (investigate)

Refs:

tallamjr / astronet

[MAJOR] Revisit model architecture #35

TODO