Closed aledcuevas closed 1 year ago
Hi,
Your understanding of TSBF looks right to me.
Indeed TSBF is hard to interpret because it consists of several steps:
X
of n_samples
time series, each of length n_timestamps
: X.shape = (n_samples, n_timestamps)
n_subsequences
subsequences from each time series, and each subsequence is also split into n_intervals
subintervals. From each subinterval, 3 features are extracted (mean, standard deviation, slope). From each subsequence, 4 features are extracted (mean, standard deviation, start index, end index). As you said, we obtain a new dataset with shape (n_samples * n_subsequences, 3 * n_intervals + 4)
. Each row is a subsequence and each column is a feature.(n_samples * n_subsequences, n_classes)
. Each row is a subsequence and each column is the out-of-bag probability of belonging to a class.(n_samples, (n_bins + 1) * n_classes)
.So, if we want to perform "reverse engineering", a feature in the final dataset is:
The big issue for interpretability is that, with the reduction functions used (mean and histogram), we lose a lot of "spatial" information:
And then, you would need to interpret the first random forest classifier...
Or you can try to only use the first random forest for the interpretation (which may be a decent approximation): its features are much easier to interpret (simple feature extracted from a given interval). And you assume that a feature that is important to classify subsequences is probably a good feature to classify whole time series, and you leave the "aggregation" step out of the interpretation.
Hope this helps you a bit, but I think that TSBF is way too complex for a simple interpretation.
Hello, I have a question about the interpretability of the TSBF model. Broadly, I want to understand what specific subsequences or intervals most contribute to the predictive power.
When accessing the TSBF estimator, we are able to inspect the features and feature importances of the second Random Forest Classifier (RFC). These features are determined by
(n_bins + 1) * n_classes
, where the order of features seems to be mapped something like.Each of these bins is derived from features derived from the subsequences and intervals for each time-series sample. I'm wondering whether there's a way to understand what are the features from each subsequence that are most important. For instance, are there specific time periods from the subsequence which are most useful in prediction? As of now, I'm also lost on how to interpret each of the bins. What does it mean for a bin_i_k to have a high feature importance?
What I've tried/inferred so far
Given a TSBF, we can access the interval indices. In my case, I have interval indices of shape 6x4, where 6 is the nr_subsequences/subseries and 4 is the nr_intervals. These interval_indices then get used to compute (start, end) pairs (a total of 18 pairs for an array of shape 6x4), which then are used to compute statistics for the subsequences (4 stats) and intervals (3 stats). These are returned as X_features, and the transformation should yield
X_new : array, shape = (n_samples * n_subseries, 3 * n_intervals + 4)
. Given that I'm working with 450 samples, I should get X.shape = (4506, 34+4) = (2700,16). This X_new is used to train a random forest RF.We can access the estimators within the TSBF, which are an ensemble of trees (i.e., random forest). Each of these trees have (n_bins+1)*n_classes as features. And what I'm trying to understand is, from that X_new which was extracted, what are the useful subsequences? What do each of these bins map to?