psaml / psaml

PySpark Sensitivity Analysis of ML models
Apache License 2.0
0 stars 0 forks source link

Augment test_data generator to handle categorical variables #11

Open Aitocir opened 8 years ago

Aitocir commented 8 years ago

_This relies on Work Item #10 _

The test_data generator accepts a data_info DataFrame, then generates some fake data in a DataFrame to use to make sensitivity analysis predictions. It only works for continuous input variables currently; we would like it to generate test data for categorical variables as well.

This enhancement is currently blocking on a discussion about how we want the data generated. For example, do we ignore the ctrl_sensitivity and just iterate all possible combinations of the variables (highly inefficient if scaled)? Do we determine a ranking so we can cal one value "min" and another "max" (inaccurate)? Do we hold them to the most common values and just call it good (potential loss of useful information)?

Aitocir commented 8 years ago

So we decided to use the few most frequent categorical values, then do a brute force. I'll be adding this later today for 4 or less categorical values. Below snippet for note purposes:

data_iris.freqItems(['Species'], 0.33).first()[0][2] This will yield the third freq item from the column Species (unsure if order indicates anything, but that doesn't matter for our purposes)

Aitocir commented 8 years ago

Some untested code has been added. I'm thinking tomorrow's get together will be a good time to run through it and see how well it works.