shuowencs / SortedEffects

Source code for R package SortedEffects
Other
5 stars 0 forks source link

Computing join p-values manually #1

Open federiconuta opened 2 years ago

federiconuta commented 2 years ago

Hi and thank you for your work on sorted effects.

Say I have a database having 2 variables, Coeff and Variable being Coeff the B CADiffs f several variables, say k (so that, overall the data contain k*B variables). I would liike to construct, starting from here, the joint p-values with your algorithm. By studying the theoretical paper, I ended up with the following steps of which I am not fully sure (since the results I obtained are not as expected):

1) Attach to the Bk dataset, the original coefficients (which have a unique value for each variable), say coeff_orig. 2) Compute the difference between bootstrap coefficient (Coeff) and coeff_orig. This is Z_c. 3) Construct sig for each variable as being the difference between the 75 and 25 quantile of Z_c (divided by the difference between the 75 and 25 quantiles of the standard normal) 4) generate t_tilde = abs(Z_c)/sig 5) generrate stat = abs(H_c)/sig where H_c = 2Coeff_orig-draws_H_mean and draws_H_mean = mean(Coeff), by(Vaariable) 6) count how many times, on average (on the number of bootstraps) and by variable, t_tilde>stat

Are these steps correct? If not, can you ,please, provide me further hints on how to compute the joint p-values starting from the mentioned dataset?

Thank you

shuowencs commented 2 years ago

@federiconuta Hi, thanks for reaching out. I'm afraid I don't fully understand your first sentence. May I ask what you mean by " Coeff and Variable being Coeff the B CADiffs f several variables, say k (so that, overall the data contain k*B variables)"?

federiconuta commented 2 years ago

@shuowencs thank you for the reply and sorry for the unclear comment.

" Coeff and Variable being Coeff the B CADiffs of several variables, say k (so that, overall the data contain k*B variables)"

Say B is the number of bootstraps made, 100 for instance, my data contain a variable Coeff. (Coeff_boot from now on) being the CADiff for each bootstrap iteration for Variable k. So, since I have k variables and 100 boot, the overall number of observations turns out to be k*100. In the example below I show a case in which B=100 and k=3 (x1,x2,x3).

I try t be clearer by making an example. So my starting database looks as follows (say I have made 100 bootstrap iterations for each variable x1, x2, x3):

Bootstrap           Variables.   Coeff_boot.    Coeff_original
1                   x1               0.03                  0.12
2                   x1               0.2                   0.12
3                   x1               0.3                   0.12
...                 ....                  ...                  ....
100                 x1              0.5                    0.54
1                   x2              0.46                   0.54
2                   x2              0.78                   0.54
3                   x2              0.90                   0.54
...                   ....               ....                      ....
100                 x2              0.55                   0.54
1                   x3              1.05                   1.4
2                    x3             0.87                   1.4
3                    x3             0.98                   1.4
...                  ...                      ....                ....
100                 100             1.67                   1.4

where Coeff_boot and Coeff_original represent (to my intuition) Lambda tilde and Lambda hat of the Sorted Effects paper (THE SORTED EFFECTS METHOD: DISCOVERING HETEROGENEOUS EFFECTS BEYOND THEIR AVERAGES), i.e. the differences in the mean of covariates. Now, my question is basically how to compute, starting from these data, the join p-values. To do that I followed these steps:

1) I generated Z_c as Coeff_boot- Coeff_orig; 2) I computed sig (variance matrix) as being, by variable, (centile(75)-centile(25))/(qnorm(0.75)-qnorm(0.25)); 3) I generated t_tilde as abs(Z_c)/sig 4) Maximized t_tilde by boot (so I have the maximum of t_tilde for boot 1, boot 2, and so on) 5) I generated stat as abs(Coeff_orig)/sig 6) Finally I computed the j-pvals as being the average (over boot) number of times that t_tilde_max>stat.

Is that procedure clear? Sorry to ask, but since I am working in Stata I would like to be sure that I am correctly reproducing the theoretical results.

Thank you

shuowencs commented 2 years ago

@federiconuta Hi Federico, sure no problem I'm willing to help out.

Your procedures look fine to me, but make sure you have the dimensions right in the code. For example, Z_c in your case is 100 3, while sig is 1 3. When you compute t_tilde, you need to do the division properly (expanding sig to 100*3) and take the max.

There are many reasons why your results may not be what you expected. Except for some coding issues (we can only hope everything we wrote is correct..), maybe try a larger number of bootstrap, say 500. Or consider alternative regression specifications.

Hope this helps. Please let me know if you have further questions. However, I'm fairly busy these days for interviews, so my response might be a bit slow.

federiconuta commented 2 years ago

@shuowencs thank you for the reply and for the help.

Your procedures look fine to me but make sure you have the dimensions right in the code. For example, Z_c in your case is 100 3, while sig is 1 3. When you compute t_tilde, you need to do the division properly (expanding sig to 100*3) and take the max.

The dimensions should be ok as Stata provides some commands (e.g., merge m:1; egen, by(group)...) that already expand the dimension to the desired size. Thank you for the advice.

There are many reasons why your results may not be what you expected. Except for some coding issues (we can only hope everything we wrote is correct..), maybe try a larger number of bootstrap, say 500. Or consider alternative regression specifications.

I checked the R source code you provided in Github and read both the theoretical paper and the R paper. You made a great job! The results I got seem overall good but since we employed join p-values on a lot of scenarios I still have to sum up everything.

Hope this helps. Please let me know if you have further questions. However, I'm fairly busy these days for interviews, so my response might be a bit slow.

Thank you a lot for your availability and help. I will let you know if any issues arise.