modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
1.46k stars 63 forks source link

Add support for missing data? #19

Closed rustrust closed 2 years ago

rustrust commented 3 years ago

Unless I have overlooked something, it appears that tangram doesn't support missing data. Being able to handle missing data would be great :-)

isabella commented 3 years ago

@rustrust Tangram has support for missing values. Throughout the codebase, we refer to them as "invalid" values. An invalid value is currently defined as a value from the following hard-coded set:

/// These values are the default values that are considered invalid.
const DEFAULT_INVALID_VALUES: &[&str] = &[
  "", "?", "null", "NULL", "n/a", "N/A", "nan", "-nan", "NaN", "-NaN",

Tangram's GBDT implementation has native support for invalid values. Continuous branch splits, used for Number features in a tree node, have a field called invalid_values_direction. In performing the forward inference pass on the GBDT, when we reach this node with a feature value that is an "invalid" value, we will use the invalid_values_direction, either left or right, to decide which branch to follow. Discrete branch splits, used for Enum features, also support invalid values by assigning invalid values to the 0th bin.

Tangram's linear implementation requires all feature values to be finite. We achieve this in the feature engineering step by using Normalized feature groups. The invalid values in this case are mapped to the value "0.0". In other words, Normalized feature groups perform mean imputation for missing values.

Are you observing a bug in Tangram's handling of invalid/missing values?

rustrust commented 3 years ago

I have a CSV with data having column types like this: integer, integer, string, float, float, integer

One of the rows has a "missing float"--so the raw CSV data looks like this: 1,2,blah,,3.14,4

And tangram dies on this input

isabella commented 2 years ago

@rustrust Could you provide the error message or the complete CSV file to help me debug this? Tangram could be crashing for a number of reasons, not limited to lack of support for missing values.

nitsky commented 2 years ago

@rustrust please re-open this issue if you continue to have trouble.