K optimization: r2 vs score

nbedelman commented 3 years ago

Hello,

I am running STITCH with a relatively small sample (40 individuals), with ~0.5X per-sample haplotagging (linked-read) sequencing. I recognize, that this sample size is pretty small for imputation, but I'm hoping STITCH will still be at least somewhat effective. I've run the program, varying K and number of generations as suggested, and am now evaluating the output. It seems that the mean score, and number of sites with scores > 0.4 increase as K increases from 2-35, where the values seem to asymptote. However, the r2 values reach their peak around K=14 (r2=0.875), and drop off on either side (K=35, r2=0.73). Number of generations has minimal effect on r2, but runs with fewer generations (10-100) consistently yield more sites with high scores than those with more generations (300-1000). Does this make sense, and would you recommend maximizing score values or r2 values when selecting K?

Thanks! Nate

rwdavies commented 3 years ago

Hi Nate,

I'm honestly shocked that such large K perform well with N=40 0.5X samples (~20X total coverage).

But anyway, I would go with external r2 as the best measure, in general, for instance what you write about K=14. If as you say for the nGen parameter, there is limited effect on r2, but some on the INFO score, I would go with the option that maximized the INFO score anyway, just in case.

Best, Robbie

nbedelman commented 3 years ago

perfect, thank you!

rwdavies / STITCH

K optimization: r2 vs score #59