It is common for the mlbackend to receive training data with columns
that do not vary across the rows. Usually they are all zero, in which
case they have no effect on training, but they could be (say) all one,
in which case the column becomes a duplicate bias vector.
In training this is a waste of resources, but the real trouble comes
with prediction. If a row has a different value in that column, its
effect is entirely random because the column is entirely untrained.
How can this happen? Well suppose the column is called 'course_X' and
the training data is from last year. Course X was not offered last
year for reasons, but this year it is.
The solution is to ignore all columns with no variation, and remember
which columns they were. This makes training faster and prediction
better.
To save the variable columns we need to transfer the indexes to and
from the TF object, because that is what we save. Other than the minor
hackiness involved there, it is all quite simple.
It is common for the mlbackend to receive training data with columns that do not vary across the rows. Usually they are all zero, in which case they have no effect on training, but they could be (say) all one, in which case the column becomes a duplicate bias vector.
In training this is a waste of resources, but the real trouble comes with prediction. If a row has a different value in that column, its effect is entirely random because the column is entirely untrained.
How can this happen? Well suppose the column is called 'course_X' and the training data is from last year. Course X was not offered last year for reasons, but this year it is.
The solution is to ignore all columns with no variation, and remember which columns they were. This makes training faster and prediction better.
To save the variable columns we need to transfer the indexes to and from the TF object, because that is what we save. Other than the minor hackiness involved there, it is all quite simple.