Flexible data slicing logic

problem

Currently the following logic is hardcoded in Manifold:

when choosing auto segmentation, user can choose comparison metric, but when choosing manual segmentation, they can't. The value will be whatever is set when they're in auto mode
when choosing manual segmentation, user can only select feature columns, but not prediction columns
comparison metric cannot be flexibly defined.

Because of this, the following use cases cannot be easily implemented:

Use cases

Real-world usage this improvement will support include (from customer interview):

Highlight direction of errors
Identify really badly-performing data points (outliers) for inspection

Solutions

Use case 1 can be implemented by setting performance metrics to indicate over/under prediction (instead of absolute prediction error), and allow users to manually segment data based on this metric column (i.e. set segmentation threshold to 0).

Use case 2 can be implemented by allowing users to manually segment data based on this metric column (i.e. set segmentation threshold to some really high value so that only a few datapoints are in group 0).

To enable these, we need to make the following fields in state independent knobs (instead of hard-coding the value of one field base on the value of another), and then hook each of them to UI controls:

isManualSegmentation: whether to apply manual (filter-based) or automatic (kmeans) data slicing
baseCols: use which columns to slice (either through creating filters for these columns, or through inputting them to kmeans clustering)
nClusters: number of clusters to use in automatic slicing (only applicable to automatic slicing)
segmentFilters: filter logic corresponding to data segment (only applicable to manual slicing)
segmentGroups: which segments to group together for comparing against each other

Milestone

To validate the success of the change, we will evaluate how the 2 user tasks in the "Use cases" section can be achieved.

Appendix

A complete list of variables in the slicing logic

### A complete list of variables in the slicing logic - Ways to define “performance column” - Delta between the prediction column and the actual column - Preset loss function - Classification: Log Loss - Regression: Mean Square Error - User-defined loss function (https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) - Use raw prediction - Column type to segment on - Feature column - Performance column - Prediction column - Create a new column (e.g. Delta between 2 performance column) - Data segmentation strategies: - Auto segmentation (through k-means) - Manual segmentation (through defining filter values) - Number of columns to segment on - Single column - Multiple column Items in 1, 2, 3, 4 are independent, e.g you can have 1a + 2a + 3a + 4a, or 1a + 2b + 3a + 4b, based on specific needs.

Code structure

### Code structure Data slicing is only part of the logic in the application. Conceptually, the functionalities of the application will be structured into the following components (**we do not actively work on the refactoring; restructuring will be done piecemeal to prioritize functionalities**.) - Data generation: updating performance metric, compute performance score, compute delta between 2 performance columns etc. These actions will cause changes in data field. - Data slicing: toggling auto/manual data slicing, choosing base columns to slice, configuring segmentation filters etc. These actions won't cause changes in data field but will change data subsets - Visualization configuration: changing which feature column to color by etc. These actions won't change data slices, but will cause display changes.

uber / manifold