Open nomoa opened 6 years ago
edit: removed the comment on ranklib DataPoint, the default constructor does nothing so the float array is properly initialized with zeros which is coherent with the reset() method.
I realized while looking over some feature values that the query explorer query also works slightly different to others with respect to missing as zero
. Query explorer looks to always match the document regardless, but emits a score of 0.0 if the provided query doesn't match the document. This is perhaps slightly complicated by classic_idf
not depending on the query terms (it should match everything) while others like raw_ttf
does depend on the terms.
The result of this is basically that missing_as_zero
doesn't combine with query explorer in the same way as a match query, even if the query explorer is exploring that same match query.
Hi there,
This is an older ticket but seems to be the canonical discussion point for dealing with missing values.
We are currently building our ESLTR pipeline, and currently we log some values which can be missing. For instance, if the data behind a logged feature is behind a feature flag, and the account/session being logged is outside of that flag, that feature will be missing in the logs: it will have an entry in the _ltr
output but not value.
The model we're training is built with XGBoost, so we are currently representing that feature as NaN
for the observation in question.
I have two questions about current best practices for a scenario like ours:
Generally, is this ticket still a priority? Does ESLTR still intend to handle missing values as distinct from 0s?
(perhaps more interestingly), given the current state of affairs, what is the best way to represent missing values for ESLTR? Intuitively, it seems problematic to simply treat them as 0s, because the 0-value for some binary feature means something other than indicating that, essentially, this particular feature is irrelevant in the case of this observation.
Other tickets here have alluded to using some other sentinel value - for instance, the maximum float amount, or perhaps -1. But I'm curious: does the ESLTR team have any current recommendations for how to express missing features as distinct from negative features? Or, alternately, is the distinction not important? Should they be treated the same as negative features?
While trying to add support for XGBoost missing direction I realized that the way we handle missing values is not very clear (code&doc wise).
During logging we allow users to set
missing_as_zero
which will emit zeros instead of nothing. After that it's up to the user to properly configure its training algorithm to handle these. E.g. XGBoost has support for them and will emit a model with an additional decisionis missing?
besides the threshold check. Today the model parser for XGBoost completely ignores the missing branch. This basically assumes the features were logged withmissing_as_zero
. Concerning ranklib, its DataPoint natively supports missing values by using NaN. But on the plugin glue code we force all values to zeros on the reset() method.Things we should fix regardless:
when logging we should fail if we emit NaN for a non-missing value (bogus feature/query)#136 .Evaluate:
boolean missing(int featureIdx)
method to the FeatureVector interface.missing_as_zero
.