udacity / mlnd_issues_tracker

0 stars 0 forks source link

Get_dummies trap: Issue with collinearity #17

Open traveling-desi opened 7 years ago

traveling-desi commented 7 years ago

Hello !

I know of the finding donors project that uses get_dummies(). I don't remember if there are any others.

Please refer to this: https://github.com/pandas-dev/pandas/issues/12042

As you can see if you run get_dummies on any feature since it is one hot encoding, the last column can be fully predicted from the rest of the columns, in fact, it is an XNOR relationship. So the correct way to use get_dummies is to use drop_first = True.

It's, of course, left to the user to write the get_dummies command but there is not talk about this issue in the notebook. If you agree this is a valid issue and the notebook needs to be changed, please update the instructions so that students will add the drop_first argument.

If you do end up making this change, please acknowledge Nupur (https://discussions.udacity.com/t/how-to-avoid-collinearity-problem-with-pd-getdummies/284692) who pointed this out.