Titanic is the very famous ship in the world. The ship crushed into the big ice and many people died. I got the data of the people on the ship and investigate how to survive in such a situation.
Now let's see the data.
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
The meaning of these data are listed below: | Variable | Definition | Key |
survival | Survival | 0 = No, 1 = Yes | |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | |
sex | Sex | ||
Age | Age in years | ||
sibsp | # of siblings / spouses aboard the Titanic | ||
parch | # of parents / children aboard the Titanic | ||
ticket | Ticket number | ||
fare | Passenger fare | ||
cabin Cabin number | |||
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
from https://www.kaggle.com/c/titanic
There are 12 categories and 891 of people's data. Age and Cabin have some missing values. Embarked has a few.
When taking a correlation between these numeric data, Pclass, Age, and Fare look like strongly correlated with Survived(target).
We use the mean value of age for lacked age data. The test set has a NaN value in Fare. For this, we use the mean value for fare.
Here, the target value of survived poeple corresponds to 1 and that of not survived does 0. The mean of the target value close to 1 means likely to survive.
For Pclass, this means the passenger class, the passengers in pclass 1 were the most survived. The worst was pclass 3.
For Embarked, this means where passenger embarked the ship, Embarked C has the most survived. S was the worst.
For sex, women tend to survived more than men.
Firstly, I am going to aim the model accuracy avobe 0.9.
We use "Pclass", "Sex", "SibSp", "Parch","Embarked" and "Age" for establish a model. For Age, I use "AgeBucket", binned the age into 15 years.
The relation between being survived and age bucket is like below.
For model, I use RandomForestClassifier.
The accuracy of the model results shown in the table below. Also I used another model based on SVM from other people made.
with embarked | without embarked | GSSVM | |
test accuracy | 0.821 | 0.827 | 0.815 |
leaderboard score | 0.782 | 0.770 | 0.775 |
The best model is now Randomforest with embarked.
Also, I tried to improve the score with embarked using GridSearch and RandomForest. The best score was 0.840. However it didin't improve the leadeboard score.
The most important feature is the sex. The second is Pclass. The third is Agebucket.
If you want to survive the accident like Titanic, you should be a rich and old woman!!
