Documentation: Update GLM FAQ and missing_values_handling parameter regarding unseen categorical values

exalate-issue-sync[bot] commented 1 year ago

From [~accountid:557058:2ceb7f2b-e7ca-465c-8e82-c046991100be]: Missing categorical levels are imputed with mod, not a special missing level.

Unseen categorical levels are treated based on the missing values handling during training.

If your missing value handling was set to imputation with mean, the unseen levels are replaced by the most frequent level present in training (mod).

If your missing value treatment was Skip, the variable is ignored for the given observation.

If you ran with use_all_factor_levels=False that essential means they are replaced by the reference level.

exalate-issue-sync[bot] commented 1 year ago

Angela Bartz commented: Updated after review

GLM FAQ now states the following regarding missing values:

How does the algorithm handle missing values during training?

Depending on the selected missing value handling policy, they are either imputed mean or the whole row is skipped. The default behavior is Mean Imputation. Note that unseen categorical levels are replaced by the most frequent level present in training (mod). Optionally, GLM can skip all rows with any missing values.
How does the algorithm handle missing values during testing?

Same as during training. If the missing value handling is set to Skip and we are generating predictions, skipped rows will have Na (missing) prediction.
What happens if the response has missing values?

The rows with missing responses are ignored during model training and validation.
What happens during prediction if the new sample has categorical levels not seen in training?

The value will be filled with either 0 or replaced by the most frequent level present in training (if missing_value_handling was set to MeanImputation).
How are unseen categorical values treated during scoring?

Unseen categorical levels are treated based on the missing values handling during training. If your missing value handling was set to Mean Imputation, the unseen levels are replaced by the most frequent level present in training (mod). If your missing value treatment was Skip, the variable is ignored for the given observation.

exalate-issue-sync[bot] commented 1 year ago

Angela Bartz commented: Pull request 1038 submitted.

exalate-issue-sync[bot] commented 1 year ago

Angela Bartz commented: Also updated the missing_values_handling parameter description as below:

"... Note that in Deep Learning, unseen categorical variables are imputed by adding an extra “missing” level. In GLM, unseen categorical levels are replaced by the most frequent level present in training (mod). Optionally, either algorithm can skip all rows with any missing values."

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4287 Assignee: Angela Bartz Reporter: Angela Bartz State: Resolved Fix Version: 3.10.4.4 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/1038

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4287 Assignee: Angela Bartz Reporter: Angela Bartz State: Resolved Fix Version: 3.10.4.4 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/1038

h2oai / h2o-3

Documentation: Update GLM FAQ and missing_values_handling parameter regarding unseen categorical values #11176