mrdbourke / tensorflow-deep-learning

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.
https://dbourke.link/ZTMTFcourse
MIT License
5.05k stars 2.5k forks source link

Notebook 01: pd.get_dummies() resulting in True/False values instead of 1/0 - Causing issues with creating model #559

Open ralversity opened 1 year ago

ralversity commented 1 year ago

Not sure if I may have just done something wrong here, or if something has changed. But I noticed that when going through this I was having trouble creating the model. I discovered that the reason is that when I did this part:

image

It resulted in this:

image

I wound up changing the function to this and it fixed it for me, although not sure if this was the right thing to do or not:

image

cwestergren commented 1 year ago

What's the error that you get in creating the model? I believe that Python implements bool as a subclass to integer and should you, for example, use a Normalization layer and use your insurance_one_hot it will be [0,1] as output.

This example shows the integer subclass

image

And then applying normalization will just use the bool and give you a [0,1] float32 back.

image

mayankbungla commented 8 months ago

Facing same issue

mrdbourke commented 8 months ago

Hi @ralversity , @cwestergren and @uKnowKlaus ,

There has been an update to pd.get_dummies() to return bool dtypes by default (rather than float or int).

You can get the behaviour of the first screenshot by setting pd.get_dummies(dtype=int).

For example:

import pandas as pd

df = pd.DataFrame({'A': ['a', 'b', 'a'], 
                   'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df_one_hot = pd.get_dummies(df, dtype=bool) # bool is default
df_one_hot

Output:

C A_a A_b B_a B_b B_c
0 1 True False False True
1 2 False True False False
2 3 True False False True

Change to dtype=int:

import pandas as pd

df = pd.DataFrame({'A': ['a', 'b', 'a'], 
                   'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df_one_hot = pd.get_dummies(df, dtype=int)
df_one_hot

Output:

C A_a A_b B_a B_b B_c
0 1 1 0 0 1
1 2 0 1 0 0
2 3 1 0 0 1

See the docs here: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

mayankbungla commented 8 months ago

Hey @mrdbourke, Thanks for your reply. I already tried changing dtype to int and float it was still returning bool values. Tried restarting the kernel no effect whatsoever.

cwestergren commented 8 months ago

Do you get an error when applying normalisation though?

It's still a subclass of Integers, as seen at https://docs.python.org/3/c-api/bool.html

See my previous reply.

mayankbungla commented 8 months ago

@cwestergren I did use normalization as well but didn't work. IDK what's the issue with get_dummies. Then I went with LabelEncoding.

cwestergren commented 8 months ago

Understood. If you want to share your code here please do, but label encoding would work too.

mayankbungla commented 8 months ago

get_dummy

cwestergren commented 8 months ago

Thanks. I'm after the point of error. It will still be a bool type, but internally it's integers.

Can you share the error you get?

mayankbungla commented 8 months ago

Sorry, I didn't save the errors. I moved on with LabelEncoding so..

cwestergren commented 8 months ago

All good, happy coding :)

samuelperezh commented 8 months ago

Hey @uKnowKlaus I had the same issue but then I tried with 'int64' instead of 'int' and it worked!

joaocastro95 commented 6 months ago

Thx everyone, I had this issue too

ehvs commented 4 months ago

@samuelperezh Hi, would you mind sharing the code you used with 'int64' ?

PatilHarshita09 commented 3 months ago

Hey @uKnowKlaus I had the same issue but then I tried with 'int64' instead of 'int' and it worked!

np.int64 and 'run all cell' it worked for me

shereenwalid commented 1 month ago

I had the same issue and even after adding dtype=int however after adding df = df.astype(int) it worked perfectly well, df = pd.get_dummies(df,sparse=False,dtype=int) df = df.astype(int)

Aseem2004 commented 2 weeks ago

Just use the inbuilt dtype method along with pd.get_dummies() like: df = pd.get_dummies(df,columns=['X','Y','Z'], dtype='int')

It works perfectly fine.