Closed JunHyunPark01 closed 4 years ago
This appears to have been opened by mistake - closing. If this was intentional, please provide further details - preferably on a new issue - and, if you'd like to include your notebook file, please attach it to the issue (you'll need to adjust the suffix to get past the GH filters). We cannot help you with the information provided here.
{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": "# Applied Data Science Capstone Project\n### Ambarish Ambuj" }, { "cell_type": "markdown", "metadata": {}, "source": "## Introduction" }, { "cell_type": "markdown", "metadata": {}, "source": "The severity code of the accident is typically set such that it represents the extent of damage caused by the accident. In an environment of limited resources, focusing more resources on preventing high severity accidents is one of the solutions to minimize the amount of damage with given resources. However, to do that, an understanding of the factors that affect the severity of the accident and the extent to which they affect the severity, is essential. Hence, with the given data about accident severity and some related parameters, this project tries to come up with a model to predict the impact of some key parameters such as accident location type, collission type, weather condition, road condition, lighting condition, number of persons involved in the collision etc. on the severity of the accident. The output of this model can provide policy inputs to the government to take specific actions to mitigate the causes that impact the accident severity the most." }, { "cell_type": "markdown", "metadata": {}, "source": "## Data Description" }, { "cell_type": "markdown", "metadata": {}, "source": "The base data is taken from the example dataset provided in the course at the link https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv\n We will first import the data as a dataframe to get a glimpse of the data." }, { "cell_type": "markdown", "metadata": {}, "source": "Importing the necessary python libraries for the project" }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": "import pandas as pd\nimport numpy as np\nimport matplotlib as mpl\nimport matplotlib.pyplot as plt\nfrom sklearn import preprocessing\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression as lr\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn import tree\nfrom sklearn import metrics\nfrom sklearn.metrics import jaccard_similarity_score, f1_score, log_loss" }, { "cell_type": "markdown", "metadata": {}, "source": "Importing the data from csv file" }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": "/opt/conda/envs/Python36/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (33) have mixed types. Specify dtype option on import or set low_memory=False.\n interactivity=interactivity, compiler=compiler, result=result)\n" } ], "source": "df = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')" }, { "cell_type": "markdown", "metadata": {}, "source": "Let us check the size of the file, the column names and some sample data to get basic understanding of the data." }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": "(194673, 38)" }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": "df.shape" }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": "Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',\n 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',\n 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',\n 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',\n 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',\n 'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',\n 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',\n 'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],\n dtype='object')" }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": "df.columns" }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>SEVERITYCODE</th>\n <th>X</th>\n <th>Y</th>\n <th>OBJECTID</th>\n <th>INCKEY</th>\n <th>COLDETKEY</th>\n <th>REPORTNO</th>\n <th>STATUS</th>\n <th>ADDRTYPE</th>\n <th>INTKEY</th>\n <th>...</th>\n <th>ROADCOND</th>\n <th>LIGHTCOND</th>\n <th>PEDROWNOTGRNT</th>\n <th>SDOTCOLNUM</th>\n <th>SPEEDING</th>\n <th>ST_COLCODE</th>\n <th>ST_COLDESC</th>\n <th>SEGLANEKEY</th>\n <th>CROSSWALKKEY</th>\n <th>HITPARKEDCAR</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2</td>\n <td>-122.323148</td>\n <td>47.703140</td>\n <td>1</td>\n <td>1307</td>\n <td>1307</td>\n <td>3502005</td>\n <td>Matched</td>\n <td>Intersection</td>\n <td>37475.0</td>\n <td>...</td>\n <td>Wet</td>\n <td>Daylight</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>10</td>\n <td>Entering at angle</td>\n <td>0</td>\n <td>0</td>\n <td>N</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>-122.347294</td>\n <td>47.647172</td>\n <td>2</td>\n <td>52200</td>\n <td>52200</td>\n <td>2607959</td>\n <td>Matched</td>\n <td>Block</td>\n <td>NaN</td>\n <td>...</td>\n <td>Wet</td>\n <td>Dark - Street Lights On</td>\n <td>NaN</td>\n <td>6354039.0</td>\n <td>NaN</td>\n <td>11</td>\n <td>From same direction - both going straight - bo...</td>\n <td>0</td>\n <td>0</td>\n <td>N</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1</td>\n <td>-122.334540</td>\n <td>47.607871</td>\n <td>3</td>\n <td>26700</td>\n <td>26700</td>\n <td>1482393</td>\n <td>Matched</td>\n <td>Block</td>\n <td>NaN</td>\n <td>...</td>\n <td>Dry</td>\n <td>Daylight</td>\n <td>NaN</td>\n <td>4323031.0</td>\n <td>NaN</td>\n <td>32</td>\n <td>One parked--one moving</td>\n <td>0</td>\n <td>0</td>\n <td>N</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>-122.334803</td>\n <td>47.604803</td>\n <td>4</td>\n <td>1144</td>\n <td>1144</td>\n <td>3503937</td>\n <td>Matched</td>\n <td>Block</td>\n <td>NaN</td>\n <td>...</td>\n <td>Dry</td>\n <td>Daylight</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>23</td>\n <td>From same direction - all others</td>\n <td>0</td>\n <td>0</td>\n <td>N</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2</td>\n <td>-122.306426</td>\n <td>47.545739</td>\n <td>5</td>\n <td>17700</td>\n <td>17700</td>\n <td>1807429</td>\n <td>Matched</td>\n <td>Intersection</td>\n <td>34387.0</td>\n <td>...</td>\n <td>Wet</td>\n <td>Daylight</td>\n <td>NaN</td>\n <td>4028032.0</td>\n <td>NaN</td>\n <td>10</td>\n <td>Entering at angle</td>\n <td>0</td>\n <td>0</td>\n <td>N</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows \u00d7 38 columns</p>\n</div>", "text/plain": " SEVERITYCODE X Y OBJECTID INCKEY COLDETKEY REPORTNO \\\n0 2 -122.323148 47.703140 1 1307 1307 3502005 \n1 1 -122.347294 47.647172 2 52200 52200 2607959 \n2 1 -122.334540 47.607871 3 26700 26700 1482393 \n3 1 -122.334803 47.604803 4 1144 1144 3503937 \n4 2 -122.306426 47.545739 5 17700 17700 1807429 \n\n STATUS ADDRTYPE INTKEY ... ROADCOND LIGHTCOND \\\n0 Matched Intersection 37475.0 ... Wet Daylight \n1 Matched Block NaN ... Wet Dark - Street Lights On \n2 Matched Block NaN ... Dry Daylight \n3 Matched Block NaN ... Dry Daylight \n4 Matched Intersection 34387.0 ... Wet Daylight \n\n PEDROWNOTGRNT SDOTCOLNUM SPEEDING ST_COLCODE \\\n0 NaN NaN NaN 10 \n1 NaN 6354039.0 NaN 11 \n2 NaN 4323031.0 NaN 32 \n3 NaN NaN NaN 23 \n4 NaN 4028032.0 NaN 10 \n\n ST_COLDESC SEGLANEKEY \\\n0 Entering at angle 0 \n1 From same direction - both going straight - bo... 0 \n2 One parked--one moving 0 \n3 From same direction - all others 0 \n4 Entering at angle 0 \n\n CROSSWALKKEY HITPARKEDCAR \n0 0 N \n1 0 N \n2 0 N \n3 0 N \n4 0 N \n\n[5 rows x 38 columns]" }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": "df.head()" }, { "cell_type": "markdown", "metadata": {}, "source": "So, there are 194673 observations of incidents. There are 38 columns in the original dataset but as is evident from a preview of first 5 rows of the data, a column called 'SeverityCode' is repeated. So, there are 37 attributes for 194673 incidents. However, going back to our problem definition, not all 37 attributes are of our interest. We are only interested in exploring the impact of certain mitigable attributes on severity of the accident. So, based on the primary theoretical understanding, we select 'SEVERITYCODE' as the dependent variable and following variables as dependent variable: \n1. 'ADDRTYPE': A catagorical variable representing the type of location where incident took place. It may take the values of 'Intersection', 'Block' etc. \n2. 'COLLISIONTYPE': A categorical variable indicating the type of collision such as head-on, angle etc.\n3. 'PERSONCOUNT': An integer representing number of persons involved in the collision.\n4. 'PEDCOUNT': An integer representing number of pedestrians involved in the collision.\n5. 'PEDCYLCOUNT': An integer representing the number of bicycles involved in the collision.\n6. 'VEHCOUNT': An integer representing the number of vehicles involved in the collision.\n7. 'WEATHER': A categorical variable describing whether the weather was cloudy or rainy etc. at the time of collision\n8. 'ROADCOND': A categorical variable describing condition of the road i.e. dry or wet\n9. 'LIGHTCOND': A categorical variable describing the lighting condition at the time of collision." }, { "cell_type": "markdown", "metadata": {}, "source": "So, let's extract the one target and 9 predictor variables from the dataframe and store it in a new dataframe." }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>SEVERITYCODE</th>\n <th>ADDRTYPE</th>\n <th>COLLISIONTYPE</th>\n <th>PERSONCOUNT</th>\n <th>PEDCOUNT</th>\n <th>PEDCYLCOUNT</th>\n <th>VEHCOUNT</th>\n <th>WEATHER</th>\n <th>ROADCOND</th>\n <th>LIGHTCOND</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2</td>\n <td>Intersection</td>\n <td>Angles</td>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>Overcast</td>\n <td>Wet</td>\n <td>Daylight</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>Block</td>\n <td>Sideswipe</td>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>Raining</td>\n <td>Wet</td>\n <td>Dark - Street Lights On</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1</td>\n <td>Block</td>\n <td>Parked Car</td>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>3</td>\n <td>Overcast</td>\n <td>Dry</td>\n <td>Daylight</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>Block</td>\n <td>Other</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>3</td>\n <td>Clear</td>\n <td>Dry</td>\n <td>Daylight</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2</td>\n <td>Intersection</td>\n <td>Angles</td>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>Raining</td>\n <td>Wet</td>\n <td>Daylight</td>\n </tr>\n </tbody>\n</table>\n</div>", "text/plain": " SEVERITYCODE ADDRTYPE COLLISIONTYPE PERSONCOUNT PEDCOUNT \\\n0 2 Intersection Angles 2 0 \n1 1 Block Sideswipe 2 0 \n2 1 Block Parked Car 4 0 \n3 1 Block Other 3 0 \n4 2 Intersection Angles 2 0 \n\n PEDCYLCOUNT VEHCOUNT WEATHER ROADCOND LIGHTCOND \n0 0 2 Overcast Wet Daylight \n1 0 2 Raining Wet Dark - Street Lights On \n2 0 3 Overcast Dry Daylight \n3 0 3 Clear Dry Daylight \n4 0 2 Raining Wet Daylight " }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": "df1 = df[[\"SEVERITYCODE\", \"ADDRTYPE\", \"COLLISIONTYPE\", \"PERSONCOUNT\", \"PEDCOUNT\", \"PEDCYLCOUNT\", \"VEHCOUNT\",\"WEATHER\", \"ROADCOND\", \"LIGHTCOND\"]]\ndf1.head()" }, { "cell_type": "markdown", "metadata": {}, "source": "As we have a large number of data available with us, the best treatment of missing values is to drop them so that we don't have to guess the missing values and thereby affect the output. So, we will drop any rows with missing values." }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": "(187504, 10)" }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": "df2 = df1.dropna(axis=0)\ndf2.shape" }, { "cell_type": "markdown", "metadata": {}, "source": "So, some of the rows were dropped and now we have 187504 observations with us to train, validate and test the model. " }, { "cell_type": "markdown", "metadata": {}, "source": "## Exploratory Data Analysis" }, { "cell_type": "markdown", "metadata": {}, "source": "We will first explore the data and try to observe some patterns within the data which may further help in our analysis exercise." }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": "1 130634\n2 56870\nName: SEVERITYCODE, dtype: int64" }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['SEVERITYCODE'].value_counts()" }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": "[]" }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAD5CAYAAAAQlE8JAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAACpZJREFUeJzt23/Il/V+x/HXx+zELMogKm5sc6s8Wup9z2a3QVvJIRadSCpHM1nSKgpitaDooP1CZBVBtIJo49SqMW4IR54yKivZj5pmmnaO0sQNa5iNPDK3HdT80bU/KvGUnU6o9/3e7eMB9z/f6wfvD1w8+XBx3a3rugAw9EYM9QAAfE6QAYoQZIAiBBmgCEEGKEKQAYoQ5IPUWhvdWlvYWvvX1tr7rbXzWmu9rbVlrbWftdZebK0df4Drvt9aW7Pf3/+01v78i2MPttZ+2lp7dr/z/6S1dutgrg0YXM13yAentfZMkn/uuu7HrbXvJRmV5LUkt3dd94+ttT9N8ttd1939K+5xVJKPkvQn2ZZkcdd1v99a+7skDyT5tySLk1zcdd3uw7wkYIjYIR+EL3a+f5DkySTpum5X13Xbknw/yT99cdprSa78llv9IMm/d133YZLPknyvtdaS/EaS3UnuSPKoGMPwJsgH53eSbEnyN6211a21H7fWjk2yNsllX5zzR0lO+5b7/HGSgSTpuu5/k/x9ktVJNib57yRTu677yWGYHyjku76y8H5jPytXrsy0adPy1ltvpb+/P7feemuOP/74zJ49O7fccku2bt2ayy67LI8++mi2bt16wHvs2rUrPT09WbduXU455ZSvHb/++utz8803Z9WqVVmyZEkmT56cu+6663AvDTi02q9zkh3yQRgzZkzGjBmT/v7+JMnMmTPz7rvvZvz48VmyZElWrVqVWbNm5fTTT//Ge7z88suZMmXKAWO8evXqJMm4cePy7LPP5rnnnsvatWuzYcOGw7MgYEgJ8kE49dRTc9ppp2X9+vVJkjfeeCNnnXVWPvnkkyTJZ599lgULFuSmm276xnsMDAxk1qxZBzx29913Z/78+dm9e3f27t2bJBkxYkS2b99+iFcCVCDIB+mxxx7L7NmzM3ny5KxZsyZz587NwMBAxo0bl/Hjx6enpyfXXnttkmTz5s255JJL9l27ffv2vPbaa7niiiu+dt9FixZl6tSp6enpyejRo3Peeedl0qRJaa2lt7d30NYHDB7vkAEOP++QAf4/EWSAIgQZoAhBBihCkAGKGDnUAxwOY3/00lCPMGx88MAPh3oEOGLYIQMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyQBGCDFCEIAMUIcgARQgyDFM7d+7Mueeem97e3px99tm59957kyQbN25Mf39/zjzzzFx11VXZtWvX165dsWJF+vr60tfXl97e3jz//PNJki1btuT888/PxIkTs2jRon3nz5gxI5s3bx6chQ1jggzD1DHHHJOlS5fmvffey5o1a/LKK69k+fLlufPOO3Pbbbdlw4YNOfHEE/Pkk09+7dqJEydm5cqV+6678cYbs2fPngwMDGTOnDlZtmxZHnrooSTJiy++mClTpqSnp2ewlzjsCDIMU621HHfccUmS3bt3Z/fu3WmtZenSpZk5c2aSZM6cOb+00/3SqFGjMnLkyCSf77Rba0mSo48+Ojt27Minn36aESNGZM+ePXnkkUdyxx13DNKqhjdBhmFs79696evry8knn5yLLroop59+ekaPHr0vtmPGjMlHH310wGvffvvtnH322Zk0aVKeeOKJjBw5MldffXVeffXVXHzxxbnvvvvy+OOP55prrsmoUaMGc1nDliDDMHbUUUdlzZo12bRpU1asWJH333//a+d8ufv9qv7+/qxbty7vvPNO7r///uzcuTMnnHBCXnrppaxcuTJTpkzJ4sWLc+WVV+aGG27IzJkzs2zZssO9pGFNkOEIMHr06Fx44YVZvnx5tm3blj179iRJNm3a9K3vfidMmJBjjz02a9eu/aXf58+fn3nz5mVgYCDnnHNOnnrqqcydO/ewreFIIMgwTG3ZsiXbtm1LkuzYsSOvv/56JkyYkOnTp2fhwoVJkmeeeSYzZsz42rUbN27cF+0PP/ww69evz9ixY/cd37BhQzZv3pwLLrgg27dvz4gRI9Jay86dOw//woYxQYZh6uOPP8706dMzefLkTJ06NRdddFEuvfTSPPjgg3n44YdzxhlnZOvWrbnuuuuSJC+88ELuueeeJMmbb76Z3t7e9PX15fLLL8/jjz+ek046ad+9582blwULFiRJZs2alaeffjrTpk3L7bffPvgLHUZa13Xf5fzvdPJQGfujl4Z6hGHjgwd+ONQjwHBw4Bf1X2GHDFCEIAMUIcgARQgyQBGCDFDEyKEeAI4kvgA6tIbbV0B2yABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFCDJAEYIMUIQgAxQhyABFtK7rfv2TW3slyUmHb5wjyklJfj7UQ8A38HweWj/vuu7ibzvpOwWZQ6e1trLrut8b6jngQDyfQ8MrC4AiBBmgCEEeOn891APAr+D5HALeIQMUYYcMUIQgAxQhyABFCDJAEYIMpLU2vrX2g9bacV/5/Vv/u4xDR5ALaK1dO9QzcORqrd2S5CdJ/izJ2tbajP0O/8XQTHVk8tlbAa21/+i67jeHeg6OTK21nyU5r+u6X7TWxiZZmORvu677y9ba6q7rfndIBzyCjBzqAY4UrbWfftOhJKcM5izwFUd1XfeLJOm67oPW2oVJFrbWfiufP58MEkEePKck+cMk//WV31uSfxn8cWCf/2yt9XVdtyZJvtgpX5rkqSSThna0I4sgD57FSY778qHfX2vtHwZ/HNjnmiR79v+h67o9Sa5prf3V0Ix0ZPIOGaAIX1kAFCHIAEUIMkARggxQxP8BckUs/c03MfcAAAAASUVORK5CYII=\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax = df2['SEVERITYCODE'].value_counts().plot(kind = 'bar')\ndef add_value_labels(ax, spacing=5, fontsize = 14):\n\n for rect in ax.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.1%}\".format(y_value/187504)\n ax.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax)\nax.spines['top'].set_visible(False)\nax.spines['left'].set_visible(False)\nax.spines['right'].set_visible(False)\nax.get_yaxis().set_ticks([])" }, { "cell_type": "markdown", "metadata": {}, "source": "As is evident from the bar chart, 69.7% of all accidents have been of severity 1 i.e. only property damage whereas remaining 30.3% accidents resulted in some human injury as well. \nThis is on expected lines as we expect more accidents of less severity." }, { "cell_type": "markdown", "metadata": {}, "source": "Let us now analyze some of the explanatory variables considered in isolation." }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": "Block 123315\nIntersection 63447\nAlley 742\nName: ADDRTYPE, dtype: int64" }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['ADDRTYPE'].value_counts()" }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": "[]" }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAEuCAYAAAC52GgqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAE7BJREFUeJzt3X2QVfV5wPHvs7BoMEVMjRqClBBbJSIgQbGNqbYZqxUlTtjEl5hQYjBtfUlNkzSRYhQTY5sy8d2QpqK1SjRag0SDVk3GjpFhVoVIfKkm0CjQjCYxGHYVWJ7+cZfNIqi7vOz53b3fzwwj9+y5+Jy58N2z555zbmQmkqTqNVU9gCSpxiBLUiEMsiQVwiBLUiEMsiQVwiBLUiH6XZAjYmhE3BYRT0XEkxHxxxFxYUSsioilnb+Of53nnhcRP4mI5RExPyJ271x+U0T8OCIu6bburIj4YF9tl6T+r98FGbgcWJSZBwHjgCc7l389M8d3/rr7tU+KiHcC5wITM3MMMAA4JSLGAmTmWOD9EbFnRLwDODwzF/TFBklqDAOrHmBnioghwJ8CfwWQmeuB9RHR0z9iIPCWiNgADAZWAxs6lzUBg4AOYDZwwU4dXlLD6297yKOAF4B5EfFYRHwrIvbo/NrZnYcdrouIvV77xMxcBfwL8HNgDfCbzLw3M5/sXPYocCtwABCZ+VhfbJCkxhG9vHS66OusW1tbOeKII3jooYeYNGkSn/70pxkyZAhnn302e++9NxHBrFmzWLNmDdddd90Wz/31r3/N1KlTueWWWxg6dCgf/vCHaWlp4fTTT99ivRNPPJG5c+cyb948li1bxjHHHMOMGTP6cjMl1Z8e/Zjer/aQhw8fzvDhw5k0aRIALS0tPProo+y7774MGDCApqYmZsyYwZIlS7Z67n333ce73vUu3v72t9Pc3MyHPvQhfvSjH22xzoIFC5g4cSLr1q1j+fLl3Hrrrdx44420tbX1yfZJ6t/6VZD3228/9t9/f55++mkA7r//ft7znvewZs2arnXuuOMOxowZs9VzR4wYweLFi2lrayMzuf/++xk9enTX1zds2MDll1/O5z73Odra2th8XHrTpk2sX79+F2+ZpEbQr97UA7jyyiv56Ec/yvr16xk1ahTz5s3j3HPPZenSpUQEI0eOZO7cuQCsXr2aT37yk9x9991MmjSJlpYWJkyYwMCBAzn00EM588wzu/7cq6++mmnTpjF48GDGjh1LZnLIIYdw/PHHM3To0Ko2V1I/0q+OIUtSoRrvGLIk1TODLEmFMMiSVAiDLEmFMMiSVIiiT3sb+YW7qh5hl1l56eSqR5BUGPeQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQJakQBlmSCmGQVYxXXnmFww8/nHHjxnHwwQfzpS99CYAzzjiDcePGMXbsWFpaWvjtb3+71XM3bNjAtGnTOOSQQxg9ejRf/epXAXjhhRc48sgjGTNmDN/97ne71v/gBz/I6tWr+2bDpB4yyCrGbrvtxgMPPMCyZctYunQpixYtYvHixXz9619n2bJl/PjHP2bEiBFcddVVWz33O9/5Dq+++iqPP/44jzzyCHPnzmXlypXMnz+fadOm8fDDD/O1r30NgIULFzJhwgSGDRvW15sovaGBVQ8gbRYRvPWtbwVqe7wbNmwgIhgyZAgAmUl7ezsRsc3nrlu3jo0bN9Le3s6gQYMYMmQIzc3NtLe38+qrr9LU1MTGjRu57LLLWLhwYZ9um9QT7iGrKB0dHYwfP5599tmHY445hkmTJgEwffp09ttvP5566inOOeecrZ7X0tLCHnvswTve8Q5GjBjBZz/7Wd72trdx2mmncc8993Dcccdx4YUXcs011/Dxj3+cwYMH9/WmSW/KIKsoAwYMYOnSpTz//PMsWbKE5cuXAzBv3jxWr17N6NGjueWWW7Z63pIlSxgwYACrV69mxYoVzJkzh5/97Gfsueee3HXXXbS2tjJhwgS+973vMXXqVGbMmEFLSwsPP/xwX2+i9LoMsoo0dOhQjj76aBYtWtS1bMCAAZx88sncfvvtW61/8803c9xxx9Hc3Mw+++zD+973PlpbW7dYZ/bs2cycOZP58+fz3ve+l+uuu47zzz9/l2+L1FMGWcV44YUXeOmllwBob2/nvvvu48ADD+TZZ58FaseQFy5cyEEHHbTVc0eMGMEDDzxAZrJu3ToWL168xXrPPPMMq1ev5qijjqKtrY2mpiYigldeeaVvNk7qAd/UUzHWrFnDtGnT6OjoYNOmTXzkIx9h8uTJvP/972ft2rVkJuPGjePaa68F4M4776S1tZXZs2dz1llnMX36dMaMGUNmMn36dMaOHdv1Z8+cOZOvfOUrAJx66qmcdNJJXH755cyePbuSbZW2JTKzN+v3auUdNfILd/Xl/65Prbx0ctUjSOo7W58atA0espCkQhhkSSqEQZakQhhkSSqEQZakQnjam3aJ/nyGDHiWjHYN95AlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAlqRAGWZIKYZAl7RKLFi3iwAMP5IADDuDSSy993fVuu+02IoLW1lYAHnroIcaOHcthhx3Gs88+C8BLL73EscceS2b2yexVMciSdrqOjg7OOussvv/97/PEE08wf/58nnjiia3We/nll7niiiuYNGlS17I5c+Zw++23c8kll3DttdcCcPHFF3P++ecTEX22DVUwyJJ2uiVLlnDAAQcwatQoBg0axCmnnMKCBQu2Wm/WrFl8/vOfZ/fdd+9a1tzcTHt7O21tbTQ3N/PTn/6UVatWcdRRR/XlJlTCIEva6VatWsX+++/f9Xj48OGsWrVqi3Uee+wxnnvuOU444YQtln/xi1/kzDPP5LLLLuPss89m5syZXHzxxX0yd9UGVj2ApP5nW8d6ux9u2LRpE+eddx7XX3/9VuuNHz+exYsXA/Dggw8ybNgwMpOTTz6Z5uZm5syZw7777rvLZq+Se8iSdrrhw4fz3HPPdT1+/vnnGTZsWNfjl19+meXLl3P00UczcuRIFi9ezJQpU7re2INa1L/85S8za9YsLrroIi666CJOP/10rrjiij7dlr5kkCXtdIcddhjPPPMMK1asYP369Xz7299mypQpXV/fc889efHFF1m5ciUrV67kiCOO4M4772TixIld69xwww1MnjyZvfbai7a2NpqammhqaqKtra2KTeoTHrKQtNMNHDiQq666imOPPZaOjg4+8YlPcPDBB3PBBRcwceLELeK8LW1tbdxwww3ce++9AHzmM59h6tSpDBo0iPnz5/fFJlQienleX5+eBDjyC3f15f+uT628dHLVI+xS/fm1g/7/+mmn69H5eh6ykKRCGGRJKoRBlqRCGGRJKoRBlqRCeNqbpK14lkw1enXaW0QsAvbedeNUbm/gxaqH0Hbxtatv/f31ezEzj3uzlXp7HnK/FhGtmTnxzddUaXzt6puvX43HkCWpEAZZkgphkLf0zaoH0Hbztatvvn54DFmSiuEesiQVwiBLUiEMsiQVwiBLUiEaOsgRccY2ll1axSxSI4mIMVXPUKJGv5dFS0S8kpk3AUTENcBuFc+kHoqI3YCpwEi6/V3OzNlVzaQe+0ZEDAKuB27OzJcqnqcIjR7kDwF3RsQm4C+BX2Xm31Y8k3puAfAb4BHg1YpnUS9k5pER8YfAJ4DWiFgCzMvM/6p4tEo15HnIEfG2bg9/D/gu8BBwAUBm/qqKudQ7EbE8M/3Rt45FxADgJOAKYC21z547PzP/s9LBKtKoQV5B7QNbo9t/N8vMHFXJYOqViPgmcGVmPl71LOqdiBgLTAcmA/8F/FtmPhoRw4CHM/MPKh2wIg0ZZPUPEfEEcACwgtohi6D2DXVspYPpTUXEg8C/ArdlZvtrvvaxzLyxmsmq1dBBjoizgJs2v6EQEXsBp2bmNdVOpp6IiG3uRWXm//b1LOq9iHgLMCIzn656llI09GlvwIzu7+5m5q+BGRXOo17oDO9Q4MTOX0ONcX2IiBOBpcCizsfjI+LOaqeqXqMHuSkiuo4fd77BMKjCedQLEfFp4CZgn85f/xER51Q7lXroQuBw4CWAzFxK7fTFhtbop73dA9waEd+g9ubeX9P5HVt14QxgUmauA4iIfwIeBq6sdCr1xMbM/E23/SFhkP8B+BTwN9TeELoX+FalE6k3Aujo9riDLc+YUbmWR8RpwIDO85HPBX5U8UyVa+g39QA6rxY6kNoe8tOZuaHikdRDEfEZYBpwR+eik4DrM/Oy6qZST0TEYGAm8BfUvoneA1ycma9UOljFGjrIEXE0cAOwktpfiv2BaZn5YIVjqRciYgJwJLXX78HMfKzikaTt1uhBfgQ4bfNpNxHxR8D8zHxvtZPpjUTEkMxc+5orLrt4pWW5ImIhtZ9Gtykzp/ThOMVp9GPIzd3PgczM/4mI5ioHUo/cDJxA7R4W3f9xb77y0isty/UvVQ9QskbfQ76O2j/gzVcFfRQYmJnTq5tKUqNq9CDvBpxFt2OQwDWZ6Z3D6kBE3J+ZH3izZSpHRDzOtg9ZBLApM8f18UhFaeggqz5FxO7AYOAHwNH87lS3IcD3M3N0RaPpTbzO5e4BDKd2l7fj+3ikojTkMeQ3+C4NgDenKd6ngL8DhlE7jrw5yGuBq6saSm+u+6XtETEeOA34CLUbRN1e1VylaMg95Ne7Kc1m3g+hPkTEOZnpVXl1pPNMplOAU4FfArcAn23U222+VkPuIW8ruBGxN/DLbMTvUPVrU0QM9W59deUp4L+BEzPzWYCIOK/akcrRkDcXiogjIuKHEfGfEXFoRCwHlgO/iIjjqp5PPebd+urPVOD/gB9ExL9GxAfwcvcuDRlk4CrgEmA+8ADwyczcD/hT4KtVDqZe8W59dSYz78jMk4GDgB8C5wH7RsS1EfEXlQ5XgEY9hrw0M8d3/v7J7u/KR8RjmXloddOppyLia9Ru2dj9bn3PZebfVzmXeqfzissPAydn5p9XPU+VGjXIj2bmhNf+fluPVa6IaKJ2xsXmH3vvBb6VmR1v+ESpUI0a5A5gHbV/xG8B2jZ/Cdg9M718uk74MUDqTxryGHJmDsjMIZn5e5k5sPP3mx8b4zoREVPwY4DUjzRkkNVvfAk/Bkj9iEFWPduYmb+peghpZ2nIC0PUb/gxQOpX3ENWPTsHOBh4ldo55Wup3eNCqksNeZaF+p/Oi0L2yMy1Vc8ibS/3kFW3IuLmiBgSEXsAPwGejojPVT2XtL0MsurZezr3iE8C7gZGAB+rdiRp+xlk1bPmzs9APAlYkJkbeIP7XEulM8iqZ98AVgJ7AA923ufaY8iqW572prrUeR+LX2TmO7st+znwZ9VNJe0Y95BVlzJzE3D2a5ZlZm6saCRph3nam+pWRMwC2ql9DNC6zcsz81eVDSXtAIOsuhURK7axODNzVJ8PI+0EBlmSCuExZNWtiBgcEf8YEd/sfPyHEXFC1XNJ28sgq57NA9YDf9L5+Hngy9WNI+0Yg6x69u7M/GdgA0BmtuMnGKuOGWTVs/WdH+GUABHxbmp3fpPqkheGqJ5dSO3jm/aPiJuA9wHTK51I2gGeZaG6FhG/DxxB7VDF4sx8seKRpO1mkFW3IuL+zPzAmy2T6oWHLFR3ImJ3YDCwd0Tsxe/eyBsCDKtsMGkHGWTVo09R+6imYcAj/C7Ia4GrqxpK2lEeslDdiohzMvPKqueQdhaDrLoWEX8CjKTbT3uZ+e+VDSTtAA9ZqG5FxI3Au4GlQEfn4gQMsuqSe8iqWxHxJLXP1fMvsfoFr9RTPVsO7Ff1ENLO4iEL1bO9gSciYgndLpnOzCnVjSRtP4OsenZh1QNIO5PHkCWpEO4hq+5ExMt03uHttV+i9hFOQ/p4JGmncA9ZkgrhWRaSVAiDLEmFMMiSVAiDLEmF+H+1Y2Stkq8S5gAAAABJRU5ErkJggg==\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax2 = df2['ADDRTYPE'].value_counts().plot(kind='bar')\ndef add_value_labels(ax2, spacing=5, fontsize = 14):\n\n for rect in ax2.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.1%}\".format(y_value/187504)\n ax2.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax2)\nax2.spines['top'].set_visible(False)\nax2.spines['left'].set_visible(False)\nax2.spines['right'].set_visible(False)\nax2.get_yaxis().set_ticks([])" }, { "cell_type": "markdown", "metadata": {}, "source": "It indicates that about two third of all accidents took place in blocks whereas about one third took place at intersections. Alleys, understandably, contributed negligible proportion of accidents." }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": "Parked Car 46679\nAngles 34555\nRear Ended 33794\nOther 23440\nSideswipe 18442\nLeft Turn 13659\nPedestrian 6589\nCycles 5399\nRight Turn 2936\nHead On 2011\nName: COLLISIONTYPE, dtype: int64" }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['COLLISIONTYPE'].value_counts()" }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax3 = df2['COLLISIONTYPE'].value_counts().plot(kind='bar')\ndef add_value_labels(ax3, spacing=5, fontsize = 14):\n\n for rect in ax3.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.0%}\".format(y_value/187504)\n ax3.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax3)\nax3.spines['top'].set_visible(False)\nax3.spines['right'].set_visible(False)" }, { "cell_type": "markdown", "metadata": {}, "source": "This bar chart provides some very interesting insights. It indicates that about a quarter of all accidents involved a parked car. It is likely that these incidents are mostly happening in blocks rather than intersection. This may be one plausible reason why blocks have more accidents than intersections. \nThis also provides an interesting policy question to address and regulate the parking in blocks to avoid these accidents. \n'Angles', 'Rear Ended' and 'Sideswipe' are other prominent types of accidents. \nFortunately, the number of accidents with 'Head On' collision is low." }, { "cell_type": "markdown", "metadata": {}, "source": "Looking at the accident location and collision type in isolation itself has given us significant insights. However, looking at them together may provide us further insights. So, we create a pivot table and a heatmap to understand it better." }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": "df2group = df2[['SEVERITYCODE','ADDRTYPE','COLLISIONTYPE']].groupby(['ADDRTYPE','COLLISIONTYPE'], as_index = False).count()" }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"10\" halign=\"left\">SEVERITYCODE</th>\n </tr>\n <tr>\n <th>COLLISIONTYPE</th>\n <th>Angles</th>\n <th>Cycles</th>\n <th>Head On</th>\n <th>Left Turn</th>\n <th>Other</th>\n <th>Parked Car</th>\n <th>Pedestrian</th>\n <th>Rear Ended</th>\n <th>Right Turn</th>\n <th>Sideswipe</th>\n </tr>\n <tr>\n <th>ADDRTYPE</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Alley</th>\n <td>57.0</td>\n <td>8.0</td>\n <td>4.0</td>\n <td>NaN</td>\n <td>284.0</td>\n <td>325.0</td>\n <td>37.0</td>\n <td>11.0</td>\n <td>NaN</td>\n <td>16.0</td>\n </tr>\n <tr>\n <th>Block</th>\n <td>5653.0</td>\n <td>2298.0</td>\n <td>1567.0</td>\n <td>2114.0</td>\n <td>19416.0</td>\n <td>45057.0</td>\n <td>1856.0</td>\n <td>29595.0</td>\n <td>1226.0</td>\n <td>14533.0</td>\n </tr>\n <tr>\n <th>Intersection</th>\n <td>28845.0</td>\n <td>3093.0</td>\n <td>440.0</td>\n <td>11545.0</td>\n <td>3740.0</td>\n <td>1297.0</td>\n <td>4696.0</td>\n <td>4188.0</td>\n <td>1710.0</td>\n <td>3893.0</td>\n </tr>\n </tbody>\n</table>\n</div>", "text/plain": " SEVERITYCODE \\\nCOLLISIONTYPE Angles Cycles Head On Left Turn Other Parked Car \nADDRTYPE \nAlley 57.0 8.0 4.0 NaN 284.0 325.0 \nBlock 5653.0 2298.0 1567.0 2114.0 19416.0 45057.0 \nIntersection 28845.0 3093.0 440.0 11545.0 3740.0 1297.0 \n\n \nCOLLISIONTYPE Pedestrian Rear Ended Right Turn Sideswipe \nADDRTYPE \nAlley 37.0 11.0 NaN 16.0 \nBlock 1856.0 29595.0 1226.0 14533.0 \nIntersection 4696.0 4188.0 1710.0 3893.0 " }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": "df2_pivot = df2group.pivot(index='ADDRTYPE', columns = 'COLLISIONTYPE')\ndf2_pivot" }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": "<Figure size 432x288 with 2 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "plt.pcolor(df2_pivot, cmap = 'RdBu')\nplt.colorbar()\nplt.yticks([0.5, 1.5, 2.5],['Alley','Block','Intersection'])\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": "The pivot table and the heatmap confirm our previous hypothesis about parked car cotributing a significant proportion of accidents reported in blocks. It is followed by the 'angle collision at intersections', 'sideswipe collision in blocks' and 'left turn collision at intersections'. If these four issues could be addressed systematically, about 60% of the accidents can be avoided." }, { "cell_type": "markdown", "metadata": {}, "source": "Now that we have explored the location of incident and type of collision, let us move on to environmental factors potentially affecting the accidents. We will specifically look at three factors for which we have data available in this dataset, namely weather, road condition and light condition." }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": "Clear 110493\nRaining 32969\nOvercast 27545\nUnknown 14057\nSnowing 896\nOther 790\nFog/Smog/Smoke 563\nSleet/Hail/Freezing Rain 112\nBlowing Sand/Dirt 49\nSevere Crosswind 25\nPartly Cloudy 5\nName: WEATHER, dtype: int64" }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['WEATHER'].value_counts()" }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax4 = df2['WEATHER'].value_counts().plot(kind='bar')\ndef add_value_labels(ax4, spacing=5, fontsize = 14):\n\n for rect in ax4.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.0%}\".format(y_value/187504)\n ax4.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax4)\nax4.spines['top'].set_visible(False)\nax4.spines['right'].set_visible(False)" }, { "cell_type": "markdown", "metadata": {}, "source": "This chart indicates that 59% of the accidents took place on clear days, 18% on rainy days and 15% on overcast days. Let us explore whether the weather had any significant effect on the severity of the accidents." }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">COLLISIONTYPE</th>\n </tr>\n <tr>\n <th>SEVERITYCODE</th>\n <th>1</th>\n <th>2</th>\n </tr>\n <tr>\n <th>WEATHER</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Blowing Sand/Dirt</th>\n <td>36</td>\n <td>13</td>\n </tr>\n <tr>\n <th>Clear</th>\n <td>74775</td>\n <td>35718</td>\n </tr>\n <tr>\n <th>Fog/Smog/Smoke</th>\n <td>377</td>\n <td>186</td>\n </tr>\n <tr>\n <th>Other</th>\n <td>676</td>\n <td>114</td>\n </tr>\n <tr>\n <th>Overcast</th>\n <td>18834</td>\n <td>8711</td>\n </tr>\n <tr>\n <th>Partly Cloudy</th>\n <td>2</td>\n <td>3</td>\n </tr>\n <tr>\n <th>Raining</th>\n <td>21835</td>\n <td>11134</td>\n </tr>\n <tr>\n <th>Severe Crosswind</th>\n <td>18</td>\n <td>7</td>\n </tr>\n <tr>\n <th>Sleet/Hail/Freezing Rain</th>\n <td>85</td>\n <td>27</td>\n </tr>\n <tr>\n <th>Snowing</th>\n <td>729</td>\n <td>167</td>\n </tr>\n <tr>\n <th>Unknown</th>\n <td>13267</td>\n <td>790</td>\n </tr>\n </tbody>\n</table>\n</div>", "text/plain": " COLLISIONTYPE \nSEVERITYCODE 1 2\nWEATHER \nBlowing Sand/Dirt 36 13\nClear 74775 35718\nFog/Smog/Smoke 377 186\nOther 676 114\nOvercast 18834 8711\nPartly Cloudy 2 3\nRaining 21835 11134\nSevere Crosswind 18 7\nSleet/Hail/Freezing Rain 85 27\nSnowing 729 167\nUnknown 13267 790" }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": "df2group2 = df2[['SEVERITYCODE','WEATHER','COLLISIONTYPE']].groupby(['WEATHER','SEVERITYCODE'],as_index= False).count()\ndf2pivot2 = df2group2.pivot(index='WEATHER', columns = 'SEVERITYCODE')\ndf2pivot2" }, { "cell_type": "markdown", "metadata": {}, "source": "On clear, rainy as well as overcast days, the ratio of severity 1 and severity 2 accidents appear to be similar indicating that the weather may not have significant effect on severity of the accidents." }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": "Dry 123730\nWet 47213\nUnknown 14005\nIce 1192\nSnow/Slush 992\nOther 124\nStanding Water 111\nSand/Mud/Dirt 73\nOil 64\nName: ROADCOND, dtype: int64" }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['ROADCOND'].value_counts()" }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax5 = df2['ROADCOND'].value_counts().plot(kind='bar')\ndef add_value_labels(ax5, spacing=5, fontsize = 14):\n\n for rect in ax5.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.0%}\".format(y_value/187504)\n ax5.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax5)\nax5.spines['top'].set_visible(False)\nax5.spines['right'].set_visible(False)" }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">COLLISIONTYPE</th>\n </tr>\n <tr>\n <th>SEVERITYCODE</th>\n <th>1</th>\n <th>2</th>\n </tr>\n <tr>\n <th>ROADCOND</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Dry</th>\n <td>83832</td>\n <td>39898</td>\n </tr>\n <tr>\n <th>Ice</th>\n <td>923</td>\n <td>269</td>\n </tr>\n <tr>\n <th>Oil</th>\n <td>40</td>\n <td>24</td>\n </tr>\n <tr>\n <th>Other</th>\n <td>82</td>\n <td>42</td>\n </tr>\n <tr>\n <th>Sand/Mud/Dirt</th>\n <td>51</td>\n <td>22</td>\n </tr>\n <tr>\n <th>Snow/Slush</th>\n <td>827</td>\n <td>165</td>\n </tr>\n <tr>\n <th>Standing Water</th>\n <td>82</td>\n <td>29</td>\n </tr>\n <tr>\n <th>Unknown</th>\n <td>13276</td>\n <td>729</td>\n </tr>\n <tr>\n <th>Wet</th>\n <td>31521</td>\n <td>15692</td>\n </tr>\n </tbody>\n</table>\n</div>", "text/plain": " COLLISIONTYPE \nSEVERITYCODE 1 2\nROADCOND \nDry 83832 39898\nIce 923 269\nOil 40 24\nOther 82 42\nSand/Mud/Dirt 51 22\nSnow/Slush 827 165\nStanding Water 82 29\nUnknown 13276 729\nWet 31521 15692" }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": "df2group3 = df2[['SEVERITYCODE','ROADCOND','COLLISIONTYPE']].groupby(['ROADCOND','SEVERITYCODE'],as_index= False).count()\ndf2pivot3 = df2group3.pivot(index='ROADCOND', columns = 'SEVERITYCODE')\ndf2pivot3" }, { "cell_type": "markdown", "metadata": {}, "source": "The proportion of wet among severity 2 accidents is slightly higher than the proportion of wet among all accidents. This may indicate a role of wet roads in increasing the severity of the accident which can be further evaluated using machine learning models. " }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": "Daylight 115395\nDark - Street Lights On 48233\nUnknown 12597\nDusk 5842\nDawn 2490\nDark - No Street Lights 1525\nDark - Street Lights Off 1184\nOther 227\nDark - Unknown Lighting 11\nName: LIGHTCOND, dtype: int64" }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": "df2['LIGHTCOND'].value_counts()" }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": "<Figure size 432x288 with 1 Axes>" }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": "ax6 = df2['LIGHTCOND'].value_counts().plot(kind='bar')\ndef add_value_labels(ax6, spacing=5, fontsize = 14):\n\n for rect in ax6.patches:\n y_value = rect.get_height()\n x_value = rect.get_x() + rect.get_width() / 2\n label = \"{:.0%}\".format(y_value/187504)\n ax6.annotate(label, (x_value, y_value), xytext = (0, spacing), textcoords = \"offset points\", ha = 'center', va='bottom')\n\nadd_value_labels(ax6)\nax6.spines['top'].set_visible(False)\nax6.spines['right'].set_visible(False)" }, { "cell_type": "markdown", "metadata": {}, "source": "Lights do not seem to be a problem as most of the accidents occurred in daylight or with street lights on. Let's see a breakdown of accident severity for various light conditions." }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">COLLISIONTYPE</th>\n </tr>\n <tr>\n <th>SEVERITYCODE</th>\n <th>1</th>\n <th>2</th>\n </tr>\n <tr>\n <th>LIGHTCOND</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Dark - No Street Lights</th>\n <td>1191</td>\n <td>334</td>\n </tr>\n <tr>\n <th>Dark - Street Lights Off</th>\n <td>869</td>\n <td>315</td>\n </tr>\n <tr>\n <th>Dark - Street Lights On</th>\n <td>33816</td>\n <td>14417</td>\n </tr>\n <tr>\n <th>Dark - Unknown Lighting</th>\n <td>7</td>\n <td>4</td>\n </tr>\n <tr>\n <th>Dawn</th>\n <td>1667</td>\n <td>823</td>\n </tr>\n <tr>\n <th>Daylight</th>\n <td>76995</td>\n <td>38400</td>\n </tr>\n <tr>\n <th>Dusk</th>\n <td>3906</td>\n <td>1936</td>\n </tr>\n <tr>\n <th>Other</th>\n <td>175</td>\n <td>52</td>\n </tr>\n <tr>\n <th>Unknown</th>\n <td>12008</td>\n <td>589</td>\n </tr>\n </tbody>\n</table>\n</div>", "text/plain": " COLLISIONTYPE \nSEVERITYCODE 1 2\nLIGHTCOND \nDark - No Street Lights 1191 334\nDark - Street Lights Off 869 315\nDark - Street Lights On 33816 14417\nDark - Unknown Lighting 7 4\nDawn 1667 823\nDaylight 76995 38400\nDusk 3906 1936\nOther 175 52\nUnknown 12008 589" }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": "df2group4 = df2[['SEVERITYCODE','LIGHTCOND','COLLISIONTYPE']].groupby(['LIGHTCOND','SEVERITYCODE'],as_index= False).count()\ndf2pivot4 = df2group4.pivot(index='LIGHTCOND', columns = 'SEVERITYCODE')\ndf2pivot4" }, { "cell_type": "markdown", "metadata": {}, "source": "The data does not point to any obvious relation between light condition and accident severity." }, { "cell_type": "markdown", "metadata": {}, "source": "## Methodology" }, { "cell_type": "markdown", "metadata": {}, "source": "To explore the predictability of accident severity based on the selected explanatory variables, we will train a classifier model on the data. \nWe will follow the following steps to arrive at a classifier model: \n1. First of all, we will create the dummy variables for the categorical variables.\n2. Next, we will split the available dataset into training, cross-validation and test data sets. Training data will be used to train the models, cross validation data will be used to fine-tune the model by adjusting certain parameters, and the test data will be used to evaluate the performance of the models.\n3. We will fit logistic regression model to the data and evaluate the accuracy.\n" }, { "cell_type": "markdown", "metadata": {}, "source": "## Results" }, { "cell_type": "markdown", "metadata": {}, "source": "### Data Pre-processing" }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": "/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype uint8, float64 were all converted to float64 by StandardScaler.\n return self.partial_fit(X, y)\n/opt/conda/envs/Python36/lib/python3.6/site-packages/ipykernel/__main__.py:20: DataConversionWarning: Data with input dtype uint8, float64 were all converted to float64 by StandardScaler.\n" } ], "source": "# Getting the explanatory variables\ndf3 = df2.iloc[:,1:10]\n# Getting the independent variable\ny = df2['SEVERITYCODE'].values\n\n# Creating dummy variables\ndf4 = pd.concat([df3,pd.get_dummies(df['ADDRTYPE'])], axis=1)\ndf4.drop(['ADDRTYPE','Alley'], axis = 1, inplace = True)\ndf5 = pd.concat([df4,pd.get_dummies(df['COLLISIONTYPE'])], axis=1)\ndf5.drop(['COLLISIONTYPE','Other'], axis = 1, inplace = True)\ndf6 = pd.concat([df5,pd.get_dummies(df['WEATHER'])], axis=1)\ndf6.drop(['WEATHER','Other'], axis = 1, inplace = True)\ndf7 = pd.concat([df6,pd.get_dummies(df['ROADCOND'])], axis=1)\ndf7.drop(['ROADCOND','Other'], axis = 1, inplace = True)\ndf8 = pd.concat([df7,pd.get_dummies(df['LIGHTCOND'])], axis=1)\ndf8.drop(['LIGHTCOND','Other','Unknown'], axis = 1, inplace = True)\ndf8.dropna(axis=0, inplace = True)\n\n# Feature Scaling\nX= preprocessing.StandardScaler().fit(df8).transform(df8)" }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "Train set: (112502, 38) (112502,)\nCross validation set: (37501, 38) (37501,)\nTest set: (37501, 38) (37501,)\n" } ], "source": "# We will split the dataset in two steps: first carving out a 20% test set and then from the training set, carving out a 20% cross validation set\nX_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\nX_train, X_cv, y_train, y_cv = train_test_split( X_temp, y_temp, test_size=0.25, random_state=10)\nprint ('Train set:', X_train.shape, y_train.shape)\nprint ('Cross validation set:', X_cv.shape, y_cv.shape)\nprint ('Test set:', X_test.shape, y_test.shape)" }, { "cell_type": "markdown", "metadata": {}, "source": "### Logistic Regression" }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": "array([0.7549932 , 0.75496653, 0.75509986, 0.75512653, 0.7550732 ])" }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": "lr_acc = np.zeros(5)\nC = [0.01, 0.03, 0.1, 0.3, 1]\nfor n in range(1,6):\n c = C[n-1]\n LR = lr(C= c, solver='liblinear').fit(X_train,y_train)\n yhat = LR.predict(X_cv)\n yhat_prob = LR.predict_proba(X_cv)\n lr_acc[n-1] = metrics.accuracy_score(y_cv, yhat)\n\nlr_acc" }, { "cell_type": "markdown", "metadata": {}, "source": "Although the accuracy is not affected significantly by changing C, the best accuracy was at c=0.3. So, let's use that model." }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": "LR3 = lr(C = 0.3, solver = 'liblinear').fit(X_train, y_train)" }, { "cell_type": "markdown", "metadata": {}, "source": "We will evaluate the accuracy of the model using the test data on 3 parameters: Jaccard Similarity Score, F1 score and log loss score." }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "The F1 score is 0.717796436024343 , the Jaccard Similarity Score is 0.7573398042718861 and the log loss score is 0.4811901425339784\n" } ], "source": "yhat = LR.predict(X_test)\nyhat_prob = LR.predict_proba(X_test)\nf1sl = f1_score(y_test, yhat, average='weighted') \njssl = jaccard_similarity_score(y_test, yhat)\nlogloss = log_loss(y_test, yhat_prob)\nprint(\"The F1 score is \", f1sl, \", the Jaccard Similarity Score is \", jssl, \" and the log loss score is \", logloss)" }, { "cell_type": "markdown", "metadata": {}, "source": "This indicates that the model has performed reasonably well in predicting the severity of the accidents based on the variables selected." }, { "cell_type": "markdown", "metadata": {}, "source": "## Discussion" }, { "cell_type": "markdown", "metadata": {}, "source": "Let's print the predictor variables and the coefficients of this model." }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": "The regression results are in the table below:\n Predictors Coefficient\n0 PERSONCOUNT 0.196347\n1 PEDCOUNT 0.503117\n2 PEDCYLCOUNT 0.557418\n3 VEHCOUNT 0.177225\n4 Block 0.384537\n5 Intersection 0.464946\n6 Angles 0.069204\n7 Cycles -0.034651\n8 Head On 0.049831\n9 Left Turn 0.044486\n10 Parked Car -0.807155\n11 Pedestrian 0.139716\n12 Rear Ended 0.153727\n13 Right Turn -0.084374\n14 Sideswipe -0.337537\n15 Blowing Sand/Dirt 0.002608\n16 Clear 0.172218\n17 Fog/Smog/Smoke 0.017697\n18 Overcast 0.105056\n19 Partly Cloudy 0.014706\n20 Raining 0.104690\n21 Severe Crosswind 0.002489\n22 Sleet/Hail/Freezing Rain -0.011673\n23 Snowing 0.015935\n24 Dry 0.303440\n25 Ice 0.042644\n26 Oil 0.024158\n27 Sand/Mud/Dirt 0.019980\n28 Snow/Slush 0.020552\n29 Standing Water 0.012825\n30 Wet 0.276915\n31 Dark - No Street Lights 0.038395\n32 Dark - Street Lights Off 0.047863\n33 Dark - Street Lights On 0.264161\n34 Dark - Unknown Lighting -0.002439\n35 Dawn 0.073487\n36 Daylight 0.289292\n37 Dusk 0.114966\n" } ], "source": "X_var = df8.columns.tolist()\ncoeff = LR3.coef_.tolist()\ndfcoef= pd.DataFrame(coeff)\ndfcoeff = dfcoef.transpose()\ndfcoeff.rename(columns = {0:'Coefficient'}, inplace = True)\ndfvar = pd.DataFrame(X_var)\ndfvar.rename(columns = {0:'Predictors'}, inplace = True)\nRegResult = pd.concat([dfvar,dfcoeff], axis =1)\nprint(\"The regression results are in the table below:\")\nprint(RegResult)" }, { "cell_type": "markdown", "metadata": {}, "source": "It is worth noting that the model considers severity 1 as '0' and severity 2 as '1' case. So, the variables with positive coefficients are more likely to be correlated to severity 2 accidents whereas the variables with negative coefficients are likely correlated to severity 1 accidents. \nIt follows from the table above that 'Parked car accidents' and 'Sideswipe accidents' have strong correlation with severity 1 accidents. This confirms our hypothesis based on exploratory data analysis. \nThe count of persons, pedestrians, cycles, and vehicles involved in accidents are all positively correlated with the severity 2 accidents i.e. more the number of persons or vehicles involved in the accident, more are the chances of it being a severity 2 accident. However, the number of persons and number of vehicles have much lower coefficient than number of pedestrians and number of cycles. So, the accidents involving pedestrians and cyclists are more likely to be severity 2 accident than the accidents involving vehicles. \n'Overcast' and 'raining' weather conditions have positive coeffiecients indicating they contribute to severity 2 accidents. Positive coefficient for 'wet' road condition further validates this point. \nAmong light conditions, 'Daylight' and 'Street Lights on' have highest positive coefficients, indicating that bad light had no significant impact in increasing the severity of the accident." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "" } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 1 }