uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
74 stars 14 forks source link

est_to_boo does not preserve the SNP block names (number of SNPs), but est_to_loo() does. Is that intended? #59

Open diegovelizo opened 10 months ago

diegovelizo commented 10 months ago

Hello,

I've been using Admixtools2 (admixtools_2.0.0) and have noticed that the function est_to_boo()does not preserve the dimension names, and instead replaces them for the generic string "l1". However, the function f2()pulls the number of SNPs (block_lengths) from the f2_blocks object, resulting in all SNP blocks being weighted the same. On the other hand, the function est_to_loo()does preserve the dimension names.

I've read the documentation and noticed that the lecture attached in the manual (https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/lecture2.pdf) explains the weighted block Jacknife procedure but does not mention any similar weighting for the the case of bootstrap.

Is this difference in the weighting procedure intended?

Thank you, Diego

uqrmaie1 commented 8 months ago

Good catch! That is in fact intended, but I admit that the code could be clearer.

In both jackknife and bootstrap, there are two steps:

  1. Calculate the statistic of interest for re-sampled subsets of the data
  2. Use these subset statistics to estimate the standard error

In jackknife the weights are needed in steps 1 and 2, but in bootstrap they are only needed in step 1. The functions est_to_loo() and est_to_boo() perform step 1. The weights are all set to 1 in est_to_boo(), because the bootstrap functions for step 2 re-use the jackknife functions for step 2. (This is unnecessarily complicated because the bootstrap SE estimate is just the standard deviation of the distribution produced in step 1.)

In est_to_boo(), the weights (the number of SNPs per block) are used in the following way: During each resampling, the blocks are resampled with replacement. The statistic for that subset is then calculated as the weighted mean of the statistics of the blocks that make up that subset (instead of just the regular mean).

There are other implementations of blocked weighted bootstrap that differ slightly from this. For example, in the R package bbw the weights influence the probability that a block is resampled, and observations are also resampled within each block.

But I'm not too concerned about the Admixtools 2 SE estimates, because in simulations, both the jackknife and the bootstrap estimates have very little or no bias in all realistic scenarios that I have tested (and they are usually pretty similar to each other).

Happy to give you a lot more details on all this if you're interested, and sorry for the late reply!

diegovelizo commented 8 months ago

Hi Robert,

Thank you for the detailed explanation. There are a few points that are still not totally clear to me, but I will go read the code again with your explanation in mind and will come back to you.

Best, Diego