odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

summary stats for raw input files for llm are missing #62

Open f-hafner opened 4 months ago

f-hafner commented 4 months ago

here https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/f6d1446d031c9ac2c40e5b1f2a60461d7fc1edc0/src/others/synthetic_data_generation/spreadsheets.sh#L28

we should add

python gen_csv_from_jsons.py "$DATAPATH/synthetic/data/raw_data/" "$DATAPATH/synthetic/data-final/raw_data/" 
f-hafner commented 1 month ago

also: we should have an item in the summary statistics with the number of unique values in the column. this way, we can then later decide on a threshold: if there are only few distinct values of a numerical column, then we should draw the fake data from a categorical distribution. if not, we draw from a normal distribution. (currently, we're doing this with a heuristic that the 10th and 90th percentile are not more than X apart, default of X is 10.)