Closed exalate-issue-sync[bot] closed 1 year ago
Angela Bartz commented: Updated after review
GLM FAQ now states the following regarding missing values:
How does the algorithm handle missing values during training?
Depending on the selected missing value handling policy, they are either imputed mean or the whole row is skipped. The default behavior is Mean Imputation. Note that unseen categorical levels are replaced by the most frequent level present in training (mod). Optionally, GLM can skip all rows with any missing values.
How does the algorithm handle missing values during testing?
Same as during training. If the missing value handling is set to Skip and we are generating predictions, skipped rows will have Na (missing) prediction.
What happens if the response has missing values?
The rows with missing responses are ignored during model training and validation.
What happens during prediction if the new sample has categorical levels not seen in training?
The value will be filled with either 0 or replaced by the most frequent level present in training (if missing_value_handling
was set to MeanImputation).
How are unseen categorical values treated during scoring?
Unseen categorical levels are treated based on the missing values handling during training. If your missing value handling was set to Mean Imputation, the unseen levels are replaced by the most frequent level present in training (mod). If your missing value treatment was Skip, the variable is ignored for the given observation.
Angela Bartz commented: Pull request 1038 submitted.
Angela Bartz commented: Also updated the missing_values_handling
parameter description as below:
"... Note that in Deep Learning, unseen categorical variables are imputed by adding an extra “missing” level. In GLM, unseen categorical levels are replaced by the most frequent level present in training (mod). Optionally, either algorithm can skip all rows with any missing values."
JIRA Issue Migration Info
Jira Issue: PUBDEV-4287 Assignee: Angela Bartz Reporter: Angela Bartz State: Resolved Fix Version: 3.10.4.4 Attachments: N/A Development PRs: Available
Linked PRs from JIRA
JIRA Issue Migration Info
Jira Issue: PUBDEV-4287 Assignee: Angela Bartz Reporter: Angela Bartz State: Resolved Fix Version: 3.10.4.4 Attachments: N/A Development PRs: Available
Linked PRs from JIRA
From [~accountid:557058:2ceb7f2b-e7ca-465c-8e82-c046991100be]: Missing categorical levels are imputed with mod, not a special missing level.
Unseen categorical levels are treated based on the missing values handling during training.
If your missing value handling was set to imputation with mean, the unseen levels are replaced by the most frequent level present in training (mod).
If your missing value treatment was Skip, the variable is ignored for the given observation.
If you ran with use_all_factor_levels=False that essential means they are replaced by the reference level.