Sparse, Binary Data: Interpolate Missing

ianhill60 commented 2 years ago

I am running into an issue related to the type of data I am using. I built a new data class that preprocesses data into the same dataframe format and indexing as the provided examples (appending repeated (sample) sequences to a dataframe indexed by sample number with each row as a timestep and each column as a feature). However, the data I am leveraging is extremely sparse and binary - many NaNs and few 1s. I noticed that your data.py has a function called interpolate_missing, that I am running on my sparse dataframe. However, it's replacing the NaNs with ones, creating a univariate DF. I'm happy to write my own function that simply replaces my NaNs with 0s, but am worried binary data might not work well with this model-type. Could you please provide intuition or guidance here?

Also, I am running this as a supervised regression task to predict the end-time (discretized) of the sequence. My current strategy is to simply provide a label_df with the numerical discretized end-time for each sample, but I know there are other ways to label for this task. Any intuition on if this strategy will be effective or if I should try something else?

Thanks,

Ian

ianhill60 commented 2 years ago

I'm also curious if anyone has implemented slider masking? Does masking occur on the test set? I'd like to control the masking of the test samples so that only the very ends (up to a specific timestep) are masked to simulate real-time prediction during sample collection. Any advice/guidance is appreciated! Thanks! Ian

gzerveas commented 2 years ago

Hi Ian,

First of all, I have never considered a binary dataset for this work, but who knows, it might work :) Can you please explain what is the meaning of NaN in your dataset? Are they unknown binary values that were simply mis-sampled? In this case, maybe you could try to replace them with either 0.5, or sample with 50% prob. between 0 and 1 to replace them. I also think that it is important to exclude these artificially imputed values from the prediction objective of the model (i.e. they shouldn't be part of the loss) by excluding them from the noise/prediction mask. In general, I expect the model to learn that it should provide predictions close to 0 or 1, and not in-between. But only experiments would tell!

I am not sure I understand correctly, but do you want to do some kind of forecasting with binary values? In this case, I would use a window that hides the last n time steps of the multivariate time series, as an auto-regression task (n could also be 1); look at Fig. 4 of https://arxiv.org/pdf/2010.02803.pdf .

Masking does not usually occur on the test set (e.g. for classification or global/extrinsic regression), unless the actual objective is filling missing values, i.e. imputation. In your case, if you are interested in (something like) forecasting, yes, you would need the mask also on the test set.

ianhill60 commented 2 years ago

Thanks for the help George!

I think I can say with decent confidence that your model works for binary data. My data preprocessing was representing 0s as NaNs. There are no true NaNs in my dataset, so I just made them all 0s. I went with predicting the length of a sequence (supplied as a numerical label) from the binary input (regression task) as opposed to approaching it as an imputation task and observing predicted length of imputation. Even with very limited data, the validation scores converged and the MAE was about 1/12th of the total sequence length - not a great result, but it does work and will improve with more data. In terms of forecasting - have you implemented synchronous masking as shown in figure 4? I was able to achieve it manually by cutting off the ends of my test samples, but masking would be more efficient.

Best,

Ian

gzerveas commented 1 year ago

Hi Ian, sorry for the long delay in responding, I am caught up in many different things. Thank you for sharing the very interesting observations with respect to binary data!

Synchronous masking is indeed implemented (and selected by setting --mask_mode concurrent), but it will randomly mask time steps (based on either a Bernoulli or geometric distribution). Specifically for forecasting, you could have a look at the transduct_mask function as a reference and very easily implement it. This masking function can be called within a ForecastingDataset, which would be almost identical to the TransductionDataset. Here is some quick untested solution that comes to my mind:

def forecasting_mask(X, masked_steps=1):
    """
    Creates a boolean mask of the same shape as X, with 0s at places where a feature should be masked.
    Args:
        X: (seq_length, feat_dim) numpy array of features corresponding to a single sample
        masked_steps: number of steps (or proportion of the time series length, if in (0, 1)) at the end of time series which will be masked
    Returns:
        boolean numpy array with the same shape as X, with 0s at places where features should be masked
    """

    mask = np.ones(X.shape, dtype=bool)
    if masked_steps > 0 and masked_steps < 1:
        masked_steps = max(1, int(np.round(masked_steps * X.shape[0])))
    mask[-masked_steps:, :] = 0

    return mask

emrul commented 1 year ago

@ianhill60 @gzerveas - just dropping in to say thanks for this issue. I have a similar dataset (sparse binary) and the (early) results are promising. I am doing this as a classification task and it was a bit tricky to get the data-prep done right.

~~I want to ask - have you any ideas on how I could modify the model to take a target label at each timestep instead of for each sample?~~

... and that is what I've found transduction is for! Nevermind :-)

gzerveas commented 1 year ago

@ianhill60 @gzerveas - just dropping in to say thanks for this issue. I have a similar dataset (sparse binary) and the (early) results are promising. I am doing this as a classification task and it was a bit tricky to get the data-prep done right.

~I want to ask - have you any ideas on how I could modify the model to take a target label at each timestep instead of for each sample?~

... and that is what I've found transduction is for! Nevermind :-)

Hey, great that you discovered transduction, and sorry for not getting back to you earlier. Yes, you can definitely use the TransductionDataset class for your purpose, together with the TSTransformerEncoder class. You simply have to initialize it in a proper way (i.e. define the input and target variables in the main.py function. This will allow you to get a prediction for each time step, and assuming you are doing binary classification, that would be all you need (because the output labels would be yet another binary variable). I would use the binary cross-entropy as a loss though, instead of the MSE used for the Transduction and Imputation tasks.

However, if you have a multi-class classification use case, then you should modify the dimensionality of the weight matrix of the output layer so as to have an output of as many dimensions as the classes you have. And of course, the loss should be a cross-entropy loss over all time steps. In this case, I would actually use the TSTransformerClassiregressor class to implement it, as it offers already the proper output layer, but removing lines 309-310 , because we don't want to concatenate all output embeddings into one, we want to use each one separately as an input to the output layer.

emrul commented 1 year ago

Thank you @gzerveas - really nicely written code overall and very easy to adapt.

gzerveas / mvts_transformer

Sparse, Binary Data: Interpolate Missing #16