metarank / metarank

A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine
https://metarank.ai
Apache License 2.0
2.08k stars 88 forks source link

Add feature names to the dataset export command #1312

Closed Currie32 closed 5 months ago

Currie32 commented 5 months ago

I'm hoping that you can add a mapping to the feature names when I use the dataset export command.

When I created a new model, I could map the feature indices to the feature names since the order was the same as in the config file. For example, the first row of train.svm looks like:

0 qid:123456 1:1.0 2:41.0 3:3.0 4:61.0 5:1.0

However, when I retrained this model and changed the features in the model and the training data, the train.svm file looked more like:

0 qid:123456 1:1.0 2:41.0 6:22.0 7:1.0 9:21.0

Given that the index of a feature no longer corresponds to the feature's name in the config file, I'm finding it difficult identify each feature.

Ideally, train.svm would look like:

0 qid:123456 feature_a:1.0 feature_b:41.0 feature_c:22.0 feature_d:1.0 feature_e:21.0

But I'd also be happy with something like a json file that has the mapping:

{
  "1": "feature_a",
  "2": "feature_b",
  "6": "feature_c",
  "7": "feature_d",
  "9": "feature_e",
}
shuttie commented 5 months ago

Metarank export format depends on the model used (so it's different for xgboost and lightgbm), and in practice tries to match it's quirks. SVM format is used for XGBoost and the issue is that the SVM file format assumes that the feature index is an index, not a name: https://stats.stackexchange.com/questions/61328/libsvm-data-format

If we change the format, then xgboost wont be able to load export data without extra transformation.

But if you switch from xgboost to lightgbm, then the format will switch to a CSV - and it includes column names.