Clarify how we handle missing values, NaN, zeros...

nomoa commented 6 years ago

While trying to add support for XGBoost missing direction I realized that the way we handle missing values is not very clear (code&doc wise).

During logging we allow users to set missing_as_zero which will emit zeros instead of nothing. After that it's up to the user to properly configure its training algorithm to handle these. E.g. XGBoost has support for them and will emit a model with an additional decision is missing? besides the threshold check. Today the model parser for XGBoost completely ignores the missing branch. This basically assumes the features were logged with missing_as_zero. Concerning ranklib, its DataPoint natively supports missing values by using NaN. But on the plugin glue code we force all values to zeros on the reset() method.

Things we should fix regardless:

~~when logging we should fail if we emit NaN for a non-missing value (bogus feature/query)~~ #136 .

Evaluate:

start to add a boolean missing(int featureIdx) method to the FeatureVector interface.
eventually fix (or add a new impl) our decision tree implementation so that it supports missing values. Sadly the xgboost format has no way to tell us if the missing branch needs to be checked (optimization).
doc: clarify how we handle missing values in the various ranker implementations we support so that users can decide properly if they want to log features with missing_as_zero.

nomoa commented 6 years ago

edit: removed the comment on ranklib DataPoint, the default constructor does nothing so the float array is properly initialized with zeros which is coherent with the reset() method.

ebernhardson commented 6 years ago

I realized while looking over some feature values that the query explorer query also works slightly different to others with respect to missing as zero. Query explorer looks to always match the document regardless, but emits a score of 0.0 if the provided query doesn't match the document. This is perhaps slightly complicated by classic_idf not depending on the query terms (it should match everything) while others like raw_ttf does depend on the terms.

The result of this is basically that missing_as_zero doesn't combine with query explorer in the same way as a match query, even if the query explorer is exploring that same match query.

subsetpark commented 2 years ago

Hi there,

This is an older ticket but seems to be the canonical discussion point for dealing with missing values.

We are currently building our ESLTR pipeline, and currently we log some values which can be missing. For instance, if the data behind a logged feature is behind a feature flag, and the account/session being logged is outside of that flag, that feature will be missing in the logs: it will have an entry in the _ltr output but not value.

The model we're training is built with XGBoost, so we are currently representing that feature as NaN for the observation in question.

I have two questions about current best practices for a scenario like ours:

Generally, is this ticket still a priority? Does ESLTR still intend to handle missing values as distinct from 0s?
(perhaps more interestingly), given the current state of affairs, what is the best way to represent missing values for ESLTR? Intuitively, it seems problematic to simply treat them as 0s, because the 0-value for some binary feature means something other than indicating that, essentially, this particular feature is irrelevant in the case of this observation.

Other tickets here have alluded to using some other sentinel value - for instance, the maximum float amount, or perhaps -1. But I'm curious: does the ESLTR team have any current recommendations for how to express missing features as distinct from negative features? Or, alternately, is the distinction not important? Should they be treated the same as negative features?

o19s / elasticsearch-learning-to-rank

Clarify how we handle missing values, NaN, zeros... #135