Basically, at the end of the day we pass into XGBoost a list of features that are allowed. These are normally cross-products across window sizes specified via a list from a set of options sampled by Optuna, a set of aggregations from options likewise sampled by Optuna, and a set of codes determined via manual entry and frequency based constraints. We should expand the scope of the searchable and expressable space to include additional kinds of options and search paradigms, including
[ ] Allowing Optuna to sample different window sizes for inclusion in the final set independently of one another from a set of total possibilities.
[ ] Allowing Optuna to sample different aggregations for inclusion in the final set independently of one another from a set of total possibilities.
[ ] Allowing Optuna to sample different codes for inclusion in the final set independently of one another from a set of total possibilities.
[ ] Allowing Optuna to sample different window sizes, aggregations, or codes in a mutually dependent manner -- e.g., for this code, use this aggregation and window, etc.
[ ] Allow the system to leverage not only global frequency relationships about codes to decide what codes should be included, but also information like the code's task-cohort specific frequency, its correlation with the label, etc.
This would necessitate changes both in the Optuna distributional space and on the data loading side, so would be an involved effort, but would result in a system that would simultaneously identify the most critical features, thereby potentially aiding in interpretability, and have much more flexibility than our current systems do.
Basically, at the end of the day we pass into XGBoost a list of features that are allowed. These are normally cross-products across window sizes specified via a list from a set of options sampled by Optuna, a set of aggregations from options likewise sampled by Optuna, and a set of codes determined via manual entry and frequency based constraints. We should expand the scope of the searchable and expressable space to include additional kinds of options and search paradigms, including
This would necessitate changes both in the Optuna distributional space and on the data loading side, so would be an involved effort, but would result in a system that would simultaneously identify the most critical features, thereby potentially aiding in interpretability, and have much more flexibility than our current systems do.