psarka / uplift

BSD 3-Clause "New" or "Revised" License
32 stars 12 forks source link

debug help #2

Closed haiy closed 4 years ago

haiy commented 5 years ago

hi, your uplift is great! I'm also trying to use it in my project.Instead of directly use it, I also want to know which paper your version based .Would you please give me a link ?Cause I'm new to cython, can you please give me some tips on how to debug it?Thanks so much~

psarka commented 5 years ago

I'm glad you are giving it a try, but please be careful, the code is experimental proof of concept, rather than something tried and tested!

Current implementation is based on 2010 paper by P.Rzepakowski, S.Jaroszewicz, "Decision trees for uplift modeling".

As for debugging cython, I won't be able to give you a decent tip, check out https://cython.readthedocs.io/en/latest/src/userguide/debugging.html, and also, if you have a license, PyCharm professional has support for debugging cython.

haiy commented 5 years ago

thanks so much~, I will read the paper first.

haiy commented 5 years ago

hi @psarka , I'm reading your cython code and the R package from Leo Guelman, I found there is a little difference between the UpliftEntroy criterion calculation. from the R version line 6-8, the gini is calculated with 2 multplied by * pr.ct1 * (1 - pr.ct1), but in your version it's just the result of p_t_l * p_t_r in line 3, so I'm confused about which one is right, cause I can't find the original formula. Would you please help me to understand it?Thanks so much.

The R version here is

 ### Euclidean gain
1  eucli.node <- (pr.y1_ct1 - pr.y1_ct0) ^ 2 + ((1 - pr.y1_ct1) - (1 - pr.y1_ct0)) ^ 2
2  eucli.l <- (pr.y1_l.ct1 - pr.y1_l.ct0) ^ 2 + ((1 - pr.y1_l.ct1) - (1 - pr.y1_l.ct0)) ^ 2
3  eucli.r <- (pr.y1_r.ct1 - pr.y1_r.ct0) ^ 2 + ((1 - pr.y1_r.ct1) - (1 - pr.y1_r.ct0)) ^ 2
4  eucli.lr <- pr.l * eucli.l + pr.r * eucli.r
5  eucli.gain <- eucli.lr - eucli.node

  ### Euclidean Normalization factor
6  gini.ct <- 2 * pr.ct1 * (1 - pr.ct1) 
7 eucli.ct <- (pr.l_ct1 - pr.l_ct0) ^ 2 + ((1 - pr.l_ct1) - (1 - pr.l_ct0)) ^ 2
8  gini.ct1 <- 2 * pr.l_ct1 * (1 - pr.l_ct1)
9  gini.ct0 <- 2 * pr.l_ct0 * (1 - pr.l_ct0)
10  eucli.norm <- gini.ct * eucli.ct + gini.ct1 * pr.ct1  + gini.ct0 * pr.ct0 + 0.5

your version


1            Gini = p_t * p_c
2            E = (p_t_l - p_c_l)**2 + (p_t_r - p_c_r)**2
3            Gini_t = p_t * p_t_l * p_t_r
4            Gini_c = p_c * p_c_l * p_c_r
5            J = Gini * E + Gini_t + Gini_c + 0.5
6          impurity_improvement += (E_gain / J)
psarka commented 5 years ago

Hi,

Sorry, do not have time at the moment to see where the $2$ comes from in the R version, but note that p_t_r = 1 - p_t_l, so the rest of the formula is the same.

haiy commented 5 years ago

ok。thanks 。