Handling Categorical Data / Validation: switching from np.ndarray -> pd.DataFrame

mitre / menelaus

Online and batch-based concept and data drift detection algorithms to monitor and maintain ML performance.

https://menelaus.readthedocs.io/en/latest/

Apache License 2.0

66 stars 7 forks source link

Handling Categorical Data / Validation: switching from np.ndarray -> pd.DataFrame #120

Open indialindsay opened 2 years ago

indialindsay commented 2 years ago

Handling categorical data in HDDDM: Ask user to input dtype as category, then label encode

For detectA, if input dtype then get_dummies()

switch to pd.DataFrame from np.ndarray for validation steps. Why?

allows for storing mapping between categorical variables and their encoding so we can help user identify either which class or which categories are experiencing drift

tms-bananaquit commented 2 years ago

If we have an internal encoder for each detector (or even just labels), then we might want to set up examples so that they decode, e.g. histograms which swap out "(0,1], (2,3], ..." for "cat, dog, marmoset".

indialindsay commented 2 years ago

I'm leaning towards requiring the user to convert categorical features to get_dummies as a preprocessing step, before using our detectors (thinking this through in context of DetectA). It seems risky to automatically convert categorical features to dummies? And it would follow with other ML libraries to require the user to preprocess categorical data prior to use

I can add to HDDDM example / documentation a note about using label encode, and to DetectA a note and example on using get dummies

Thoughts? @tms-bananaquit

indialindsay commented 2 years ago

saving DetectA handling categorical features for later.

Problem: Because we use PCA to whiten the matrix, we must use one-hot encoding to handle categorical variables. When computing the covariance matrix, this results in the correlations of the one-hot encoded variables being the same for several features (Ex: a few features will all have identical rows in the covariance matrix because the correlation between them and other one-hot encoded features is either 0 or 1). This causes the determinant of the covariance matrix to be 0, so we cannot compute the inverse and calculate the T2 statistic.

Potential solutions could consider another form of encoding the categorical variables.. encode using one-hot for PCA and then convert back to label encoding?

tms-bananaquit commented 2 years ago

After more discussion, we'll likely leave proper encoding to the user, with examples. One can imagine cases where the incoming data is already encoded properly, as it comes out of a query, or similar, so asking the user to potentially back-convert to e.g. a dataframe is potentially duplicating work for them and adding the burden of more code for us. Will think a bit more about this and see whether there's good reason to make other tweaks to the validation.