moodlehq / moodle-mlbackend-python

Moodle machine learning backend
GNU General Public License v3.0
18 stars 19 forks source link

Classifier: remove invariant columns #23

Closed douglasbagnall closed 4 years ago

douglasbagnall commented 4 years ago

It is common for the mlbackend to receive training data with columns that do not vary across the rows. Usually they are all zero, in which case they have no effect on training, but they could be (say) all one, in which case the column becomes a duplicate bias vector.

In training this is a waste of resources, but the real trouble comes with prediction. If a row has a different value in that column, its effect is entirely random because the column is entirely untrained.

How can this happen? Well suppose the column is called 'course_X' and the training data is from last year. Course X was not offered last year for reasons, but this year it is.

The solution is to ignore all columns with no variation, and remember which columns they were. This makes training faster and prediction better.

To save the variable columns we need to transfer the indexes to and from the TF object, because that is what we save. Other than the minor hackiness involved there, it is all quite simple.

douglasbagnall commented 4 years ago

incorporated into #27, where a 5x speed-up is measured, partly due to this, and partly due to the switch to 32 bit.