square / pysurvival

Open source package for Survival Analysis modeling
https://www.pysurvival.io/
Apache License 2.0
350 stars 106 forks source link

Tutorial - Employee Retention - Dropping low salary feature #1

Closed Olof-Hojvall closed 5 years ago

Olof-Hojvall commented 5 years ago

I don't fully understand how the salary feature is handled in the Employee Retention. There appears to be an ordinal with 3 categories: low, medium and high. What happens here is that:

  1. The salary feature is one-hot encoded - Why wouldn't an ordinal encoding work here, considering the tree model?

  2. The correlation is then tested on the "low" and "medium" columns, which is very negative - Isn't this quite expected, considering it's a categorical feature?

  3. The "low" column is dropped - Doesn't that mean that we effectively grouped "high" and "low" salary together?

steph-likes-git commented 5 years ago

Hi @Olof-Hojvall ,

That's a very good point; as we're using a tree, the ordinal encoding works.

I first wrote this tutorial thinking that I was going to use the Neural MTLR model, which needs the one hot encoding. But then realized that the Conditional Survival Forest model provided better results, so I used it instead, but didn't change the Exploratory Data Analysis part.

So feel free to use your approach and let me know what you find in terms of performance.

Thanks for your interest in the project and let me know if you consider the matter resolved.