ronojoy / pylearn

Bayesian machine learning in Python
MIT License
76 stars 32 forks source link

Titanic: How to use the "Name" attribute #2

Open yeskarthik opened 9 years ago

yeskarthik commented 9 years ago

Intuitively name seems to be directly not related with the survival of a person but how else can it be used? Can it be used in the case when one of age or sex is not given and we can estimate one of them using name?

ronojoy commented 9 years ago

@yeskarthik, your question is good one and its answer shows how to handle the missing data problem using Bayesian networks. Compare the state of the network in titanic where all the nodes are observed (and hence grayed out) with the state of the network in titanic-missingdata where the "age" node is not observed. So far as querying the network goes, there is no difference in the two cases. In the first, you want

P(survival | age, gender, class, embark, name) 

and in the second, you want to marginalize over the unknown variable, age, to get

P(survival | gender, class, embark, name) = \sum_{age}P(survival, age | gender, class, embark, name)

Both these operations can be easily performed using pomegranate.

Note that in the first case, all nodes in the Markov blanket of survived are known and hence the "name" node is irrelevant. In the second case, one node, "age" within the Markov blanket is unknown, but the value of its child node, "name" is known. Hence, we can use this information when marginalizing over the unknown node in the Markov blanket. This is the utility of the "name" node, from a theoretical point of view.

The question remains of estimating the numbers in the conditional probability tables. I would just use the values that are available in the data to learn the conditional probabilities. Afterwards, you can query the network for probable values of the missing data. For instance, what is

P(age | gender, class, embark, name, survival) 

Again, so far as network queries are concerned, there is really no difference between what is an unobserved node (hence, unknown) and what is missing data (hence, unknown).

For more theory, have a look at this link from Bayesia.

I think, with the method above, you can substantially improve your Kaggle score.

yeskarthik commented 9 years ago

Thanks for the description :+1:

I have coded up the version with embarked as another node.

  1. my kaggle score went down from .76 to 0.74

Then I added name node with age and sex as dependencies as described above. With 5 categories for name - (Mr., Mrs., Miss., Master., Others) - so a conditional probability table of 5 x 5 (age) x 2 (gender) = 50 rows

  1. the kaggle score was almost the same at 0.74
  2. For imputation I've just used the median (for training) Should we conclude that name node in the above graph doesn't make a big difference?

https://github.com/yeskarthik/pylearn/blob/master/scripts/titanic-notebook.ipynb

ronojoy commented 9 years ago

@yeskarthik, I do not think the name node is irrelvant, but it does need to be dealt with in a principled way in the face of missing data. The weakest points of the code you currently have are

The first of these is easily resolved, you need to have as many states in the age variable as there are distinct ages in the data set. The second and third points will involve more work if it is to be done carefully. Please have a look at this paper

Learning Tree Augmented Naive Bayes Classifier from incomplete datasets

for a method for parameter learning in the presence of incomplete data. Section 2.3.3 of the paper contains the material relevant to this problem.

In this problem, we first need to figure out if the missing data is MCAR (missing completely at random) or MAR (missing at random). See the above paper for an explanation. Then, we can move forward with one of the parameter learning algorithms that are appropriate for this situation.

There must be a way to automate all the grunt work that you are doing now in estimating the conditional probabilities - can you look around to see which graphical model / Bayesian network codes learn parameters from data and which can be called from Python ?

fareez-ahamed commented 9 years ago

I have a feeling that having

as many states in the age variable as there are distinct ages in the data set

will not improve the score, since all the ages will be considered discrete variables.

Say we have age and survived data as follows

Age Survived (Count)
41 5
42 2
43 0
44 4
45 8

It is clear from the data 41-45 has good probability of surviving. But if we bin on each distinctive age, then 43 will be considered low in survival probability.

In discrete variables, there is no difference in relevance between 42 and 28 compared to 42 and 43. Though in reality we know 43 is close to 42 than 28. So I think binning on distinct numbers is not a good idea. Binning on a reasonable number range like 5 in the case of Age is good I feel.

Correct me if I'm wrong.

ronojoy commented 9 years ago

@fareez-ahamed, I did not fully get

In discrete variables, there is no difference in relevance between 42 and 28 compared to 42 and 43. Though in reality we know 43 is close to 42 than 28. So I think binning on distinct numbers is not a good idea. Binning on a reasonable number range like 5 in the case of Age is good I feel.

What I am suggesting is that the age node take on values 1, 2, 3, ... a_max, where a_max is the maximum age reported in the data. Now, this will make the data more sparse and smoothing will acquire even more relevance. As you can see, for age = 43, the count is zero. As we have discussed, this does not mean that anyone with age = 43 in the test set should perish with probability 1. I feel the two interesting features here are

The missing data means that we will have to do better than just sampling from a multinomial distribution. Let me decode the paper mentioned above and I'll post some more comments.

fareez-ahamed commented 9 years ago

What I was trying to express is, Say we have the following data,

Age Survived (Count)
... ...
25 0
26 1
27 0
28 0
29 1
... ...
41 5
42 2
43 0
44 4
45 8
... ...

When we bin on each distinct age available in the data (as shown in the above table), P(Survival| age=28) = P(Survival | age=43). But from the data, we have an intuitive feeling that P(Survival | age=43) > P(Survival|age=28).

When we have very large number of dataset, probably this doesn't matter, but when the survival rate is a very less percentage of the total number of records, then I feel binning on distinct age might reduce the accuracy.

Whereas P(Survival | age= 25 to 30) < P(Survival | age = 41-25) seems to be a better option...

ronojoy commented 9 years ago

@fareez-ahamed, absolutely correct, and this is where the role of the prior comes in. The mathematical expression of

an intuitive feeling that P(Survival | age=43) > P(Survival|age=28).

should go into the prior probability. Recall that we will learn the probabilities by smoothing, so that

P( age = 28 | survival) = n_28 + alpha / (N + alpha K) 

Clearly, for the same counts, n_28 = n_43, lets say, you have to impose different alphas. This will result in different probability assignments, even if the counts are equal.

fareez-ahamed commented 9 years ago

Got it :+1: !!! Should read more about prior selection and different smoothing methods :smiley:

When I was trying to figure out better methods to fit in age other than binning, I was looking for ways to use Continuous Distributions (Later I found out that, fitting age into which Continuous Distribution is itself a big debate). Korb & Nickelson was suggesting to convert Continuous variables to discrete always as its easier to calculate.

So prior of smoothing function matters a lot too!!

Thanks @ronojoy for the explanation :smile:

ronojoy commented 9 years ago

Ok - good! For an accessible introduction to smoothing for multinomial sampling, have a look at

Bayesian Networks for Data Mining by David Heckerman.

Looking forward to improved scores on Kaggle! :)