Closed rustrust closed 2 years ago
@rustrust Tangram has support for missing values. Throughout the codebase, we refer to them as "invalid" values. An invalid value is currently defined as a value from the following hard-coded set:
/// These values are the default values that are considered invalid.
const DEFAULT_INVALID_VALUES: &[&str] = &[
"", "?", "null", "NULL", "n/a", "N/A", "nan", "-nan", "NaN", "-NaN",
];
Tangram's GBDT implementation has native support for invalid values. Continuous
branch splits, used for Number
features in a tree node, have a field called invalid_values_direction
. In performing the forward inference pass on the GBDT, when we reach this node with a feature value that is an "invalid" value, we will use the invalid_values_direction
, either left or right, to decide which branch to follow. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/tree/lib.rs#L251 Discrete branch splits, used for Enum
features, also support invalid values by assigning invalid values to the 0th bin. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/tree/compute_binned_features.rs#L286
Tangram's linear implementation requires all feature values to be finite. We achieve this in the feature engineering step by using Normalized
feature groups. https://github.com/tangramxyz/tangram/blob/a188d7befa5b5612e45d08ba9be72638a617de87/crates/features/normalized.rs#L134 The invalid values in this case are mapped to the value "0.0". In other words, Normalized
feature groups perform mean imputation for missing values.
Are you observing a bug in Tangram's handling of invalid/missing values?
I have a CSV with data having column types like this: integer, integer, string, float, float, integer
One of the rows has a "missing float"--so the raw CSV data looks like this: 1,2,blah,,3.14,4
And tangram dies on this input
@rustrust Could you provide the error message or the complete CSV file to help me debug this? Tangram could be crashing for a number of reasons, not limited to lack of support for missing values.
@rustrust please re-open this issue if you continue to have trouble.
Unless I have overlooked something, it appears that tangram doesn't support missing data. Being able to handle missing data would be great :-)