rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

Can widely varied depth samples run together? #4

Open TrilbyWhite opened 7 years ago

TrilbyWhite commented 7 years ago

First, my complements on STITCH: I'm frequently critical bioinformatics software, but STITCH seems very well polished.

I am, however, just getting started using it. My immediate goal is to impute genotype calls for 11 3x WGS dog samples. I'm doubtful that 11 * 3x would be sufficient read data, but I can run these together with bams of ~90 other dog samples I have on hand. Those other samples are all 30x depth. STITCH seems to readily take samples of varied read depth as input, but I'd like to know whether a wide range could lead to any problems. At one level, I suspect the more read data the better - but I'd also be concerned as to whether any of the calculations would be biased by the over-representation of the 30x samples.

Specifically, could having many 30x samples in with the 3x samples drive up confidence in variants found in the 30x in such a way it could bias calls in the 3x? Or could the varied read depth lead to an over-representation of the 30x variants in the reconstructed ancestral haplotypes (if that even makes sense)?

I can downsample the 30x samples and test this empirically, but it might take a few permutations of downsampling and rerunning STITCH to directly test the effect of varied depth samples. Before I go through that I thought I'd check whether varied read-depth samples would already be known either to be safe or to be ill-advised.

Thanks,

Jesse

rwdavies commented 7 years ago

Thanks! I borrowed user interface ideas I've seen elsewhere, and added both unit and acceptance tests to allow me to fix bugs confidently and implement larger changes confident I wasn't breaking existing functionality

I have not formally assessed the impact of the presence of order of magnitude differences in coverage levels on downstream imputation accuracy. In general, it should probably still work fine, however the model does effectively assume uniform sample coverage due to each read contributing independently to the likelihood. This could lead to problems if coverage levels were unequally distributed between different breeds / subspecies / distantly related individuals - the ancestral haplotypes might be susceptible to look more like the group that had higher coverage than those that had lower coverage, especially if only a moderate number of ancestral haplotypes are available. Again, unclear how much of a problem this would be in practice, especially if the higher coverage samples were distributed across a broad range of populations

One last thing, I also haven't thoroughly tested running so few samples (101) at a time through STITCH. In general, I've focused on running hundreds or thousands, although again that's because I usually deal with much lower per-sample coverage. I would hope that you are able to get good results with 100 samples, especially as you have a lot of high coverage samples

TrilbyWhite commented 7 years ago

Thanks. The 30x coverage samples are pretty diverse, so hopefully that will help. I will run a few tests of downsampling the 30x samples and post any results here (it may take some time).

I was planning on running at least one trial with the 30x downsampled to 3x in order to assess concordance between the STITCH calls and calls from our WGS variant calling pipeline. I should be able to use that same run to assess whether and to what degree calls in our 11 3x samples change when run alongside 30x vs just other 3x samples.

As for the low sample number this batch of 11 is a pilot run: assuming we can impute from the low coverage genomes, we will be adding many more low coverage samples.

rwdavies commented 7 years ago

Sounds like a good plan, hope it all goes well. And glad to hear you'll add a lot more samples, that will certainly improve performance. Might be interesting to see if you can go lower than 3X and achieve performance you find satisfactory as well, if that would enable to you sequence more samples