Groupwise Stratification in Sampling

exalate-issue-sync[bot] commented 1 year ago

Allow a user to specify a column that should be handled as a group. This column should be used to ensure all samples pertaining to that group are either in sample or out of sample, but never mixed.

This functionality would pertain to any sampling, such as probabilistic tree algorithms, split frame, and deep learning.

The rationale is that a user may know that the data set has an autocorrelation factor. If it does, it will be an invalid display of accuracy to allow the algorithms to learn part of the group's targets and predict against the rest. The user would be responsible for understanding whether this is the right course of action (often it is valid to take advantage of autocorrelation).

Competition examples, among many, include African Soil, and BCI. https://www.kaggle.com/c/afsis-soil-properties https://www.kaggle.com/c/inria-bci-challenge

exalate-issue-sync[bot] commented 1 year ago

Mark Landry commented: Also related to this would be additional subsampling methods:

Contiguous/continuous: separate in order. This can be done in R with subsetting, but not easily with the Flow API. An added benefit is that it is much easier to accomplish than the groupwise, and if a data set naturally has its groups together, continuous splitting will roughly accomplish the goal (whereas random splitting will not).

Stratified sampling: near random split, but where the target variable is maintained in roughly the same proportion within all splits.

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1848 Assignee: New H2O Bugs Reporter: Mark Landry State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

Groupwise Stratification in Sampling #14807