Open indialindsay opened 2 years ago
If we have an internal encoder for each detector (or even just labels), then we might want to set up examples so that they decode, e.g. histograms which swap out "(0,1], (2,3], ..." for "cat, dog, marmoset".
I'm leaning towards requiring the user to convert categorical features to get_dummies as a preprocessing step, before using our detectors (thinking this through in context of DetectA). It seems risky to automatically convert categorical features to dummies? And it would follow with other ML libraries to require the user to preprocess categorical data prior to use
I can add to HDDDM example / documentation a note about using label encode, and to DetectA a note and example on using get dummies
Thoughts? @tms-bananaquit
saving DetectA handling categorical features for later.
Problem: Because we use PCA to whiten the matrix, we must use one-hot encoding to handle categorical variables. When computing the covariance matrix, this results in the correlations of the one-hot encoded variables being the same for several features (Ex: a few features will all have identical rows in the covariance matrix because the correlation between them and other one-hot encoded features is either 0 or 1). This causes the determinant of the covariance matrix to be 0, so we cannot compute the inverse and calculate the T2 statistic.
Potential solutions could consider another form of encoding the categorical variables.. encode using one-hot for PCA and then convert back to label encoding?
After more discussion, we'll likely leave proper encoding to the user, with examples. One can imagine cases where the incoming data is already encoded properly, as it comes out of a query, or similar, so asking the user to potentially back-convert to e.g. a dataframe is potentially duplicating work for them and adding the burden of more code for us. Will think a bit more about this and see whether there's good reason to make other tweaks to the validation.
Handling categorical data in HDDDM: Ask user to input dtype as category, then label encode
For detectA, if input dtype then get_dummies()
switch to pd.DataFrame from np.ndarray for validation steps. Why?