uber / manifold

A model-agnostic visual debugging tool for machine learning
Apache License 2.0
1.65k stars 118 forks source link

Flexible data slicing logic #20

Closed Firenze11 closed 4 years ago

Firenze11 commented 5 years ago

problem

Currently the following logic is hardcoded in Manifold:

Because of this, the following use cases cannot be easily implemented:

Use cases

Real-world usage this improvement will support include (from customer interview):

  1. Highlight direction of errors
  2. Identify really badly-performing data points (outliers) for inspection

Solutions

Use case 1 can be implemented by setting performance metrics to indicate over/under prediction (instead of absolute prediction error), and allow users to manually segment data based on this metric column (i.e. set segmentation threshold to 0).

Use case 2 can be implemented by allowing users to manually segment data based on this metric column (i.e. set segmentation threshold to some really high value so that only a few datapoints are in group 0).

To enable these, we need to make the following fields in state independent knobs (instead of hard-coding the value of one field base on the value of another), and then hook each of them to UI controls:

Milestone

To validate the success of the change, we will evaluate how the 2 user tasks in the "Use cases" section can be achieved.

Appendix

A complete list of variables in the slicing logic ### A complete list of variables in the slicing logic - Ways to define “performance column” - Delta between the prediction column and the actual column - Preset loss function - Classification: Log Loss - Regression: Mean Square Error - User-defined loss function (https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) - Use raw prediction - Column type to segment on - Feature column - Performance column - Prediction column - Create a new column (e.g. Delta between 2 performance column) - Data segmentation strategies: - Auto segmentation (through k-means) - Manual segmentation (through defining filter values) - Number of columns to segment on - Single column - Multiple column Items in 1, 2, 3, 4 are independent, e.g you can have 1a + 2a + 3a + 4a, or 1a + 2b + 3a + 4b, based on specific needs.
Code structure ### Code structure Data slicing is only part of the logic in the application. Conceptually, the functionalities of the application will be structured into the following components (**we do not actively work on the refactoring; restructuring will be done piecemeal to prioritize functionalities**.) - Data generation: updating performance metric, compute performance score, compute delta between 2 performance columns etc. These actions will cause changes in data field. - Data slicing: toggling auto/manual data slicing, choosing base columns to slice, configuring segmentation filters etc. These actions won't cause changes in data field but will change data subsets - Visualization configuration: changing which feature column to color by etc. These actions won't change data slices, but will cause display changes.
Firenze11 commented 5 years ago

34