gzerveas / mvts_transformer

Multivariate Time Series Transformer, public version
MIT License
750 stars 173 forks source link

Shape Issue When Masking A Single Timesteps Feature Values #38

Open rishanb opened 1 year ago

rishanb commented 1 year ago

I'm attempting a classification on custom data. There are 8 features and 447 time steps or samples in the train/val set. I'm guessing the issue is with my dataset, so I provide some shape prints below.

The issue occurs in dataset.py, line ~263 (added a lot of comments so might be a bit off) which reads: for m in range(X.shape[1]): # feature dimension throws IndexError: tuple index out of range

Printing out some variables to debug, just before the above problematic line, we can see X is a single dimension array with size = number of features:

 if distribution == 'geometric':  # stateful (Markov chain)
        if mode == 'separate':  # each variable (feature) is independent
            mask = np.ones(X.shape, dtype=bool)
            print(f'type X: {type(X)}')
            print(f'X.shape: {X.shape}')
            print(f'X.shape[1]: {X.shape[1]}')

Gives:

type X: <class 'numpy.ndarray'>
X.shape: (8,)

Further up, around line 35 or so is where the noise_mask is called. I've printed out some variables there to debug too:

 X = self.feature_df.loc[self.IDs[ind]].values  # (seq_length, feat_dim) array
        print(f'\nshape X: {X.shape}')
        print(f'X: {X}')
        print(f'self.feature_df: {self.feature_df.shape}')
        print(f'self.IDs[ind]: {self.IDs[ind]}')
        print(f'\nBuilding mask')
        mask = noise_mask(X, self.masking_ratio, self.mean_mask_length, self.mode, self.distribution,
                          self.exclude_feats)  # (seq_length, feat_dim) boolean array

Gives:

shape X: (8,)
X: [ 0.62708933  0.75219719 -0.65542292 -0.25243002 -1.11766093 -1.75127136
 -0.79571237 -0.17200066]
self.feature_df: (447, 8)
self.IDs[ind]: 2022-05-27T00:00:00.000000000

Edit: The command I'm using to run is:

python src/main.py --output_dir experiments --comment "pretraining through imputation" --name pretrained_ex1 --records_file Imputation_records_ex1.xls --data_dir data_preprocessing/1 --data_class csv --pattern TRAIN --val_ratio 0.2 --epochs 2 --lr 0.001 --optimizer RAdam  --pos_encoding learnable --num_layers 3  --num_heads 16 --d_model 128 --dim_feedforward 512 --batch_size 128
rishanb commented 1 year ago

After looking through the code more, I suspect the shape of my source data might need to be altered. Right now self.all_df has shape (number of time steps, number of features + labels) self.feature_df has shape (number of time steps == self.all_IDs, number of features), should this instead be (self.all_IDs = 1, number of time steps, number of features)? That doesn't quite make sense to me when I look at the example_data_class.py but it seems as though that is what the ImputationDataset.__getitem__ is expecting?

gzerveas commented 1 year ago

Hi, the expected shape and format of the dataframe currently is (num_IDs * time_steps_per_ID, num_features_and_labels_per_step). That is, each row in the dataframe corresponds to a single timestep, but multiple rows are indexed by the same sample ID. This means that when you call self.all_IDs.loc[my_id] you will get all rows/timesteps for this ID (as a sub-dataframe or series), and thus when you call self.feature_df.loc[my_ID].values you will get a (seq_length, feat_dim) array, as exactly happens in the ImputationDataset here. This is because the code in the general case needs to accommodate things like, sample31 = a 2 seconds-wide measurement window of 7 sensors (features) with sampling rate (in the signal processing sense) 100 samples/sec corresponding to the activity "fishing" (label); sample92 = a different (e.g. captured on a different day) 1 sec. measurement window corresponding to the activity "resting", etc. So these two samples (in the machine learning sense, with IDs 31 and 92) would occupy 200 + 100 = 300 rows, each with 7 features (+ 1 label). Does this clarify things?

I suspect that in your case, maybe you are always considering 1 "sample" = 1 time step (where sample is used in the signal processing sense, not the machine learning sense). In this case, feel free to use "pseudo-IDs", e.g., one per row, by using the simple (ordinal) row index. It really depends on what you are trying to do (i.e. what is your data and what are your labels).

rishanb commented 1 year ago

Thanks for the detailed response. This clarifies things, and is along the lines I was thinking what the intended use case of the ID's was.

My original approach was to set ID's == index (i.e. the "pseudo-ID's" idea you mentioned) but hit the above shape issue in the original post and couldn't find an obvious way to get this line to return a 2 dimension numpy array. It returns (num_of_features,) as a shape as is.

My temporary solution was to do this right after the line linked to above:

X = self.feature_df.loc[self.IDs[ind]].values  # (seq_length, feat_dim) array

# ! Added potential code breaking line here
X = np.expand_dims(X, axis=0)

Which I believe does something like this:

X = [1, 2, 3, 4, 5, 6, 7, 8]
X = np.expand_dims(X, axis=0)
# X =  [[1, 2, 3, 4, 5, 6, 7, 8]]

And from the sounds of the intention of the IDs it sounds like this should work well with the rest of the code (from applying the masks to fine tuning for classification)? i.e. is consistent with what the rest of the code is expecting. And if this is true, is there still a more general way alter the shape from the data class itself?