shogun-toolbox / shogun

Shōgun
http://shogun-toolbox.org
BSD 3-Clause "New" or "Revised" License
3.03k stars 1.04k forks source link

Label assertion and mapping in Machine #5054

Open gf712 opened 4 years ago

gf712 commented 4 years ago

Currently some classification algorithms check whether the input Labels are valid, e.g. the class labels are continuous [0, 1, ..., n_classes-1], which leads to a lot of duplicate code. These checks should be done by the Machine base class when training is performed. The Machine will then store the mapping of any Label input to an internal encoding, e.g. a binary classification task would map {10,20} -> {-1,+1} using a BinaryLabelEncoder class, and similarly there would be a MulticlassLabelsEncoder class for multiclass tasks. The properly encoded Labels are then dispatched to the train_machine method. When apply is called the returned Labels are mapped back to the user input Labels space using the LabelEncoder.

The tasks (in order):

Most of this code already exists, but it is spread around the code base

karlnapf commented 4 years ago

a lot of the conversion code is inside the labels classes already, so can be re-used. E.g. here and here

Also note that some of this code is already used within the old approach, where algorithm classes convert the labels to the appropriate form (rather than the base class doing it as outlined above). See e.g. here. This would just be removed with the approach described above as the algorithms are guaranteed to receive the appropriate labels. Finally, this old approach currently in use might cause bugs/wrong results when used within xvalidation as the mappings (might) change across folds....