Is there a way to keep textual labels / targets as a part of the trained model?

artemiusgreat commented 4 years ago

First of all, thank you for sharing this library. Second, would be great to make mapping between columns and feature names mode obvious. If model was serialized and saved on one computer and deserialized and loaded on the other one, then second computer will have no idea what's the meaning of labels / targets, because model keeps them as double values.

Save model


var labels = new[] { "Good", "Bad", "Average" ...  };
var labelKeys = labels.Select((v, i) => (double) i);  // take label key instead of name 
var learner = new ClassificationDecisionTreeLearner();
var model = learner.Learn(items, labelKeys); // is there any reason not to use string labels instead of doubles?

using (var memoryStream = new MemoryStream())
{
  var serializer = new GenericXmlDataContractSerializer();
  serializer.Serialize<IPredictorModel<double>>(model, () => new StreamWriter(memoryStream));
  db.Save(memoryStream.ToArray());  // convert XML to byte[] and save as Blob to DB
}

Load model

var xmlModel = db.Get(...).AsBlob().GetBytes(); // load saved model from blob column in DB

using (var memoryStream = new MemoryStream(xmlModel))
{
  var serializer = new GenericXmlDataContractSerializer();
  var xml = serializer.Deserialize<IPredictorModel<double>>(() => new StreamReader(memoryStream));
}

As a result, loaded model has property Targets that contains some double values, like 1, 2, 3, 4 and there is no way to understand that initially they meant "Good", "Bad", etc

Question

Is there a way to save original string labels / targets as a part of the model to make prediction results human-readable?

mdabros commented 4 years ago

Hi @artemiusgreat.

Thanks for using SharpLearning, and thanks for opening the issue. Currently all learners in SharpLearning implements the IPredictor and IPredictorLearner interfaces. That is, both Regression models and Classification models share the interface. Using string labels would not be an option for a regression model since the targets are floating point numbers. So inorder to share the interface, classification learners use doubles for targets as well.

Practically, if you need to know the names of the classes the model predicts, you can store the mapping next to the model:

var labels = new[] { "Good", "Bad", "Average" ...  };
var labelKeyToLabelName = Enumerable.Range(0, labels.Length)
   .ToDictionary(i => (double)i, i => labels[i]);

using (var memoryStream = new MemoryStream())
{
  var serializer = new GenericXmlDataContractSerializer();
  serialier.Serialize<Dictionary<double, string>>(labelKeyToLabelName , () => new StreamWriter(memoryStream));
  db.Save(memoryStream.ToArray());  // convert XML to byte[] and save as Blob to DB
}

If you want to store them together you can also write you own type using string as the prediction type, and then serialize and use that type instead:

[Serializable]
public class ClassificationPredictorModel : IPredictorModel<string>
{
    readonly IPredictorModel<double> m_model;
    readonly Dictionary<double, string> m_labelValueTolabelName;

    public ClassificationModel(IPredictorModel<double> model, 
        Dictionary<double, string> labelValueTolabelName)  
    {
        m_model = model ?? throw new ArgumentNullException(nameof(model));
        m_labelValueTolabelName = labelValueTolabelName ?? throw new ArgumentNullException(nameof(labelValueTolabelName));
    }

    public double[] GetRawVariableImportance()
    {
        return m_model.GetRawVariableImportance();
    }

    public Dictionary<string, double> GetVariableImportance(
        Dictionary<string, int> featureNameToIndex)
    {
        return m_model.GetVariableImportance(featureNameToIndex);
    }

    public string Predict(double[] observation)
    {
        var labelValue = m_model.Predict(observation);
        return m_labelValueTolabelName[labelValue];
    }

    public string[] Predict(F64Matrix observations)
    {
        var predictions = new string[observations.RowCount];
        var observation = new double[observations.ColumnCount];
        for (int i = 0; i < observations.RowCount; i++)
        {
            observations.Column(i, observation);
            predictions[i] = Predict(observation);
        }
        return predictions;
    }
}

Best regards Mads

mdabros commented 4 years ago

Alternatively. if you don't need the variable importance methods, you can use the IPredictor interface for a simpler implementation:

[Serializable]
public class ClassificationPredictor : IPredictor<string>
{
    readonly IPredictor<double> m_model;
    readonly Dictionary<double, string> m_labelValueTolabelName;

    public ClassificationPredictor(IPredictor<double> model,
        Dictionary<double, string> labelValueTolabelName)
    {
        m_model = model ?? throw new ArgumentNullException(nameof(model));
        m_labelValueTolabelName = labelValueTolabelName ?? throw new ArgumentNullException(nameof(labelValueTolabelName));
    }

    public string Predict(double[] observation)
    {
        var labelValue = m_model.Predict(observation);
        return m_labelValueTolabelName[labelValue];
    }

    public string[] Predict(F64Matrix observations)
    {
        var predictions = new string[observations.RowCount];
        var observation = new double[observations.ColumnCount];
        for (int i = 0; i < observations.RowCount; i++)
        {
            observations.Column(i, observation);
            predictions[i] = Predict(observation);
        }
        return predictions;
    }
}

artemiusgreat commented 4 years ago

It works, thanks.

Final version

public class MapModel<TKey, TValue> : IPredictorModel<KeyValuePair<TKey, TValue>>
{
  public IDictionary<TKey, TValue> Map { get; set; }
  public IPredictorModel<double> Model { get; set; }

  public double[] GetRawVariableImportance()
  {
    return Model.GetRawVariableImportance();
  }

  public Dictionary<string, double> GetVariableImportance(Dictionary<string, int> featureNameToIndex)
  {
    return Model.GetVariableImportance(featureNameToIndex);
  }

  public KeyValuePair<TKey, TValue> Predict(double[] observation)
  {
    var predictionKey = ConversionManager.Value<TKey>(Model.Predict(observation));

    return Map.TryGetValue(predictionKey, out TValue prediction) ? new KeyValuePair<TKey, TValue>(predictionKey, prediction) : default;
  }

  public KeyValuePair<TKey, TValue>[] Predict(F64Matrix observations)
  {
    var predictions = new KeyValuePair<TKey, TValue>[observations.RowCount];
    var observation = new double[observations.ColumnCount];

    for (var i = 0; i < observations.RowCount; i++)
    {
      observations.Column(i, observation);
      predictions[i] = Predict(observation);
    }

    return predictions;
  }
}

Example

var container = new MapModel<int, string>
{
  Map = new Dictionary<int, string>{ [1] = "Good", [2] = "Bad" },
  Model = new ClassificationDecisionTreeLearner().Learn(Observations, Targets)
};

serializer.Serialize(container, () => new StreamWriter(memoryStream));
var model = serializer.Deserialize<MapModel<int, string>>(() => new StreamReader(memoryStream));
var predictions = model.Predict(processor.Input.Observations);

artemiusgreat commented 4 years ago

Side notes

Decided to make both types generic - TKey and TValue to make sure that TKey is an integer type and will not cause normalization issues like in the code below.

Dictionary<double, dynamic> map = new Dictionary<double, dynamic>();

double sourceKey = 0.2; // 0.19999999999999574
double complexKey = 0.1;

complexKey += 0.1; // 0.2
map[sourceKey] = 1;

var v1 = map.ContainsKey(sourceKey) && map.ContainsKey(complexKey);
var v2 = map.TryGetValue(sourceKey, out int a) && map.TryGetValue(complexKey, out int b);
var v3 = map[sourceKey] & map[complexKey];  // Exception, because sourceKey is not equal to complexKey

Debugger results

sourceKey => 0.19999999999999574
complexKey => 0.2   
v1 => false 
v2 => false
a => 1  
b => 0

mdabros commented 4 years ago

@artemiusgreat Thanks for adding the additional notes and conclusion. Glad it worked!

mdabros / SharpLearning

Is there a way to keep textual labels / targets as a part of the trained model? #132