kanishk-adapt / semeval-task10

Repo for SemEval Task #10 EDOS 2023. created and maintained for DCU - ADAPT submissions
Other
0 stars 0 forks source link

redundant features #10

Open jowagner opened 1 year ago

jowagner commented 1 year ago

Currently, the feature matrix contains columns with identical column vectors, e.g. for the two features TL:1 obvious and TT:1 obvious.

While this redundancy can effect model predictions positively for some model types and hyper-parameters, e.g. random forests using a sample of features in each split, in general we don't expect an advantage from features with identical column vectors (all values identical for the training data).

However, the respective column vectors in the feature matrix of the test set may be different. Simply picking a feature at random could introduce non-deterministic behaviour. It may be better to replace all columns that have the same column vector with a new feature that reports the average value in the group of features.

A way to implement this would be to add: