project-palooza / unsupervised

0 stars 3 forks source link

preprocessing optimization (researcher degrees of freedom) #4

Open a-arad opened 3 months ago

a-arad commented 3 months ago

this ticket cannot be assigned until issue #1 and #2 are complete

our preprocessing pipeline currently is:

we cannot be sure this is the best way to preprocess the data.

so set up an experiment using the following guidelines:

  1. for each preprocessing step, make a list of options

e.g.

missing data -> [impute, drop] transformation -> [none, x -> log(x), x -> box-cox(x), x -> sqrt(x), .... ] scaling -> [none, robust, standard (z-score), minmax] outliers -> [none, winsorization, drop, something else]

then for each combination

e.g.

[impute, x -> log(x), minmax, drop]

  1. fit a k means with k=2
  2. record the inertia score for the model

then determine which combination of preprocessing steps results in the lowest inertia

Sofia0204 commented 3 months ago

Hi, I think I can do this one after issue #1 is completed

a-arad commented 2 months ago

yes - i followed up on issue 1 and will let you know when this issue becomes clear to work on