Open f-hafner opened 4 months ago
also: we should have an item in the summary statistics with the number of unique values in the column. this way, we can then later decide on a threshold: if there are only few distinct values of a numerical column, then we should draw the fake data from a categorical distribution. if not, we draw from a normal distribution. (currently, we're doing this with a heuristic that the 10th and 90th percentile are not more than X apart, default of X is 10.)
here https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/f6d1446d031c9ac2c40e5b1f2a60461d7fc1edc0/src/others/synthetic_data_generation/spreadsheets.sh#L28
we should add