Open rishanb opened 1 year ago
After looking through the code more, I suspect the shape of my source data might need to be altered.
Right now self.all_df has shape (number of time steps, number of features + labels)
self.feature_df has shape (number of time steps == self.all_IDs, number of features)
, should this instead be (self.all_IDs = 1, number of time steps, number of features)
? That doesn't quite make sense to me when I look at the example_data_class.py
but it seems as though that is what the ImputationDataset.__getitem__
is expecting?
Hi, the expected shape and format of the dataframe currently is (num_IDs * time_steps_per_ID, num_features_and_labels_per_step)
. That is, each row in the dataframe corresponds to a single timestep, but multiple rows are indexed by the same sample ID. This means that when you call self.all_IDs.loc[my_id]
you will get all rows/timesteps for this ID (as a sub-dataframe or series), and thus when you call self.feature_df.loc[my_ID].values
you will get a (seq_length, feat_dim)
array, as exactly happens in the ImputationDataset
here. This is because the code in the general case needs to accommodate things like, sample31 = a 2 seconds-wide measurement window of 7 sensors (features) with sampling rate (in the signal processing sense) 100 samples/sec corresponding to the activity "fishing" (label); sample92 = a different (e.g. captured on a different day) 1 sec. measurement window corresponding to the activity "resting", etc. So these two samples (in the machine learning sense, with IDs 31 and 92) would occupy 200 + 100 = 300 rows, each with 7 features (+ 1 label). Does this clarify things?
I suspect that in your case, maybe you are always considering 1 "sample" = 1 time step (where sample is used in the signal processing sense, not the machine learning sense). In this case, feel free to use "pseudo-IDs", e.g., one per row, by using the simple (ordinal) row index. It really depends on what you are trying to do (i.e. what is your data and what are your labels).
Thanks for the detailed response. This clarifies things, and is along the lines I was thinking what the intended use case of the ID's was.
My original approach was to set ID's == index (i.e. the "pseudo-ID's" idea you mentioned) but hit the above shape issue in the original post and couldn't find an obvious way to get this line to return a 2 dimension numpy array. It returns (num_of_features
,) as a shape as is.
My temporary solution was to do this right after the line linked to above:
X = self.feature_df.loc[self.IDs[ind]].values # (seq_length, feat_dim) array
# ! Added potential code breaking line here
X = np.expand_dims(X, axis=0)
Which I believe does something like this:
X = [1, 2, 3, 4, 5, 6, 7, 8]
X = np.expand_dims(X, axis=0)
# X = [[1, 2, 3, 4, 5, 6, 7, 8]]
And from the sounds of the intention of the IDs
it sounds like this should work well with the rest of the code (from applying the masks to fine tuning for classification)? i.e. is consistent with what the rest of the code is expecting. And if this is true, is there still a more general way alter the shape from the data class itself?
I'm attempting a classification on custom data. There are 8 features and 447 time steps or samples in the train/val set. I'm guessing the issue is with my dataset, so I provide some shape prints below.
The issue occurs in dataset.py, line ~263 (added a lot of comments so might be a bit off) which reads:
for m in range(X.shape[1]): # feature dimension
throwsIndexError: tuple index out of range
Printing out some variables to debug, just before the above problematic line, we can see X is a single dimension array with size = number of features:
Gives:
Further up, around line 35 or so is where the
noise_mask
is called. I've printed out some variables there to debug too:Gives:
Edit: The command I'm using to run is: