sds-capstone / 2022-09-proj7-women-at-table

Women at the Table
0 stars 0 forks source link

Baseline model with all variables and Coefficient plot #32

Closed MargaretBassney closed 2 years ago

rporta23 commented 2 years ago

Doc with code to review: Rose.ipynb

rporta23 commented 2 years ago

@MargaretBassney will you review this when you get a chance? I think you may have more intuition than me about it anything is wrong since you have done this before.

MargaretBassney commented 2 years ago

Hi @rporta23 , I just took a quick look at it for now and wow it looks great!! I mean you got the baseline model, the coefficient plot and figured out a lot of the data wrangling bits. I'll look it over in more depth later I just wanted to let you know that I'm still looking it over.

I agree with the skipping the one hot encoding because our categorical variables are already coded by number although I must admit that I've never heard of one hot encoding before this class. Also the replacing na's with column means. I feel like there are a lot of trade offs when dealing with na's. Like removing them loses some data, but replacing them with the means kind of adds artificial data. I think that either approach is fine and kind of a personal preference. Maybe i'm totally wrong about that but its just my opinion lol.

You mentioned how some of the coefficients are really small. I asked Sofia about selecting only certain features. She said that cutting out features can add bias into the model but there are feature selection methods in machine learning that we can use so that we aren't adding bias. Also we can just look at the coefficients and see which ones are the smallest and cut those out. This can sometimes make our model more accurate and enlargen coefficients that are being drowned out by meaningless variables. Sofia seemed really against this idea though so maybe i'm wrong.

I'll continue to look it over before our meeting on Monday so I can give better feedback but honestly I think you did a great job!

MargaretBassney commented 2 years ago

Okay I looked over the code again and I still think it looks good,

for the country variable we could do dummy variables. I think we can do something like pd.get_dummies(df["economy"])

and you can wrap it in print to see what it does.

I still haven't looked into what exactly one hot encoding does, which I should do soon, so I don't know if this way is the best option.

rporta23 commented 2 years ago

Thanks so much for the detailed feedback @MargaretBassney ! I am also going to have Prof. Cao look over it to see if we can resolve those couple of issues that we still have.