Currently, the feature matrix contains columns with identical column vectors, e.g. for the two features TL:1 obvious and TT:1 obvious.
While this redundancy can effect model predictions positively for some model types and hyper-parameters, e.g. random forests using a sample of features in each split, in general we don't expect an advantage from features with identical column vectors (all values identical for the training data).
However, the respective column vectors in the feature matrix of the test set may be different. Simply picking a feature at random could introduce non-deterministic behaviour. It may be better to replace all columns that have the same column vector with a new feature that reports the average value in the group of features.
A way to implement this would be to add:
[ ] A function that identifies the groups of columns with identical values, returning a set of sets of column indices.
[ ] A function that calculates the average values for each group, deletes the column vectors and appends the new column vectors in a deterministic order, e.g. each set ordered by the smallest index in the set.
[ ] At training time, after building the preliminary feature matrix and before calling sklearn to train on it, indentify the groups and replace the redundant columns with average columns using the two functions above.
[ ] Make the groups an attribute of the model so that this information is stored with the model
[ ] At test time, apply the stored groups using the second function above.
Currently, the feature matrix contains columns with identical column vectors, e.g. for the two features
TL:1 obvious
andTT:1 obvious
.While this redundancy can effect model predictions positively for some model types and hyper-parameters, e.g. random forests using a sample of features in each split, in general we don't expect an advantage from features with identical column vectors (all values identical for the training data).
However, the respective column vectors in the feature matrix of the test set may be different. Simply picking a feature at random could introduce non-deterministic behaviour. It may be better to replace all columns that have the same column vector with a new feature that reports the average value in the group of features.
A way to implement this would be to add: