smadha / MlTrio

CSCI-567 course project
Apache License 2.0
0 stars 0 forks source link

Handle skewness #5

Open smadha opened 7 years ago

smadha commented 7 years ago

Classes are highly imbalanced

$ cut -d $'\t' -f3 invited_info_train.txt | grep -c 0
218428
$ cut -d $'\t' -f3 invited_info_train.txt | grep -c 1
27324

Few ideas to handle skewness -

  1. Take 27324 from class 0 to balance data set
  2. Create dictionary of words/character only from class 1. So words not present in class 1 will never be a feature.