sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.07k stars 206 forks source link

Using percentages instead of counts to compare distribution of two tables #834

Open borisRa opened 2 years ago

borisRa commented 2 years ago

Hi,

How can I compare between train/test distributions ? Using this code : plot_diff([train_df[train_df.columns[~train_df.columns.isin(['Survived'])]], test_df],config={"diff.label": ["train_df", "test_df"]})

I am getting counts as is , I would like to compare percentage instead. Similar to this plot for Age distribution : image

Thanks ! Boris

jinglinpeng commented 2 years ago

Hi @borisRa , thanks for proposing the issue. Will diff.density=True works for you? (related: https://github.com/sfu-db/dataprep/pull/698)

borisRa commented 2 years ago

Hi @borisRa , thanks for proposing the issue. Will diff.density=True works for you? (related: #698)

nope . should be similar to the plot above to be able to compare distributions and not counts