modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
Other
1.46k stars 63 forks source link

Add support for missing data? #19

Closed rustrust closed 2 years ago

rustrust commented 3 years ago

Unless I have overlooked something, it appears that tangram doesn't support missing data. Being able to handle missing data would be great :-)

isabella commented 3 years ago

@rustrust Tangram has support for missing values. Throughout the codebase, we refer to them as "invalid" values. An invalid value is currently defined as a value from the following hard-coded set:

/// These values are the default values that are considered invalid.
const DEFAULT_INVALID_VALUES: &[&str] = &[
  "", "?", "null", "NULL", "n/a", "N/A", "nan", "-nan", "NaN", "-NaN",
];

Tangram's GBDT implementation has native support for invalid values. Continuous branch splits, used for Number features in a tree node, have a field called invalid_values_direction. In performing the forward inference pass on the GBDT, when we reach this node with a feature value that is an "invalid" value, we will use the invalid_values_direction, either left or right, to decide which branch to follow. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/tree/lib.rs#L251 Discrete branch splits, used for Enum features, also support invalid values by assigning invalid values to the 0th bin. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/tree/compute_binned_features.rs#L286

Tangram's linear implementation requires all feature values to be finite. We achieve this in the feature engineering step by using Normalized feature groups. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/features/normalized.rs#L134 The invalid values in this case are mapped to the value "0.0". In other words, Normalized feature groups perform mean imputation for missing values.

Are you observing a bug in Tangram's handling of invalid/missing values?

rustrust commented 3 years ago

I have a CSV with data having column types like this: integer, integer, string, float, float, integer

One of the rows has a "missing float"--so the raw CSV data looks like this: 1,2,blah,,3.14,4

And tangram dies on this input

isabella commented 2 years ago

@rustrust Could you provide the error message or the complete CSV file to help me debug this? Tangram could be crashing for a number of reasons, not limited to lack of support for missing values.

nitsky commented 2 years ago

@rustrust please re-open this issue if you continue to have trouble.