sibyl-dev / sibylapp

Sibyl application
MIT License
1 stars 0 forks source link

Handle multiple data transformations without having to recompute every time predict is called #62

Closed zyteka closed 3 years ago

zyteka commented 4 years ago

Description

There are multiple forms a dataset can take:

  1. Human readable, which will likely have string values for many features (such as True/False, and category names). This is the version that should be displayed in Sibylapp to the user. However, for explanation accuracy, this should include engineered features.
  2. Converted to numeric, which has booleans replaced with 1/0, and categories replaced with OHE or integers. This is the version that is may be most flexible for some explanation types, namely counterfactuals. However, these explanations should also be recoded to take the human readable form.
  3. Transformed for the model, where the data has undergone any number of transformations to be in a state readable by the model. Currently, this is most naturally done whenever model.predict() is called. However, this leads to many calls of transform(), which will be very slow if transform() is a non-trivial function. Additionally, this requires developers to write python functions to convert arbitrary rows of human-readable data to transformed data - if data was preprocessed in another language or program, ideally they should be able to just use this data directly.

Solution

To fix all these problems, I propose adding an optional field to Entity documents in the database that stores features parsed for each model in the form of Dict: {model_id: {feature_name: feature_value, ...}, ...} Functions that call predict() will use these features values if they are available, otherwise it will default to use the standard feature. Additionally, the features field will be updated for existing datasets to use human readable strings.

zyteka commented 3 years ago

This change will be handled by the updated Pyreal library