jhaskin / ds_portfolio

Data Science Projects
0 stars 0 forks source link

Final project quick presentation - advice #5

Closed masongallo closed 8 years ago

masongallo commented 8 years ago

What is your response variable you're trying to predict? Make sure you explicitly state it and define it in your slides. Did you check for outliers? If so, how? I am concerned about noise from using many different types of crimes. Should petty crimes be considered in the same "population" as violent crimes? I suspect not, so it might be more appropriate to focus on a particular category for your models.

jhaskin commented 8 years ago

Mason,

What were you doing writing to me at 2 in the morning??? Anyway I have made some progress to day, but maybe not to the conclusion we would like. But here is what I've done. Any suggestions would be appreciated. -Response (Crime) Variables I've tried several: Crime_level - where I rank each category 1-4 based on how violet/weather related I think they are. Crime_count - just number of incidents. CrimeOfPassion(COP) - all the incidents that have descriptions that contain any of about 25 key words, like murder, rape, etc. violent_Count : only incidents from the assault, rape, domestic violence categories. I've tried them all and so far I get the best scores from the simple count of incidents. I have a couple ideas I'll mention after I talk about what else I've done.

Outliers: There was a problem with many reports filed the first day of the month at 00:01. I removed those, but there were still a large number on the first of month. I might want to just replace all the day 1 values with the mean?????

Outliers in the crime count : I removed a few low and high ones. The lows were all on Christmas day. Only removed about 15 out of 4747 records. Many want to remove a few more.

Started modeling with the crime_level variable. Features: went back to minimum number of basic weather features. (Temp, Humidity , pressure, etc). Not very good R2 score. (.06) Added Dummies for day of week jumped to .13. Added in month dummies (.17). added windchill , apparent temp and Rain. Got up to .18.

Then tried same setup on the simpler. crime_count variable and jumped up to .25. The other variables that tried to narrow the crimes to just the more violent/weather crimes did much worse. like .07 and .10.

This was done using Lasso regression. The SVM was similar, a point or so below. But Lasso was much faster to run so I could try many more combos. I could also look at the coefficients. I could see if I added to many correlated (temp_max and temp_min) they would cancel each other.

Thoughts: The day and month features are doing most of the work here. If I ran them alone I got a .18. So the weather features only seem to add another .07 or so. (Also saw that just adding a single weekend feature instead of the 6 dummies performed much worse.) This might explain why narrowing the crimes down preformed worse. If non-weather features were doing most of the work then trying to use only emotional/weather crimes reduced the data a lot. Also realized that many crimes I did not pick may have weather/date relationships also. Things like speeding tickets could be effected. Also the police might be effected by the weather and just give more tickets if they are hot and tired.

To Try: Could remove a few more outliers. Could try and take out the worst performing records and retry, as you mentioned in you cheatsheet. Could try PCA on the best performing model, but I tried it earlier and ended up cutting performance quite a bit. I've also tried to remove most of the highly correlated features on my own. I could go to the internet and see what I can find. I've been trying it on my own for the learning experience......

So the big question is .. " Is r2 score of .25 or .30 good enough for anything?" It might give some indication if the day will be better or worse that average, but probable not much better than the cops know from their own intuition.

If you have any other thoughts, please let me know. I'll send this to Lema also. Making a trip down for office hours might be too much for me.

Thanks

Jim

lemonsoup commented 8 years ago

Thanks Jim, I can see the thread here!
I'll defer to Mason for further modeling advice, but in terms of trying to see if weather impacts different types of crime in different ways, could do do a quick checks such as groupby the crime type, and see if there are any with significantly different mean temperatures or humidity etc levels than others. If they are all around the same then that particular method of categorizing crime is probably not helpful. Since the impact of weather on crime may be a subtle thing, you may need to dig deeper, -- such as adding features like 5 day average temperature (maybe a week of heatwave has an impact?)