rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

A question about the results of STITCH #49

Closed Tengjun0520 closed 1 year ago

Tengjun0520 commented 3 years ago

Hi I have a question about the results of STITCH: how to distinguish which SNPs are called or imputed.

rwdavies commented 3 years ago

Hi,

Is this in any particular context? They can be used somewhat synonymously, though I would say "calling" is usually about determining whether a site is variable in a population (not done by STITCH) and/or determining in individual samples the genotype likelihoods (and posteriors) using direct genotyping information (sequencing reads), while "imputing" is inferring the genotype posterior probabilities (and things like dosages) across individual samples in a population, usually or often done without direct evidence for the genotype at any particular site (with or without reference data, as in STITCH)

Best Robbie

Tengjun0520 commented 3 years ago

Hi Robbie,

I may not have described it clearly. I recently used STITCH to analyze my low-coverage sequencing data. In our data, Low MAF SNP also has higher accuracy. I want to distinguish which SNPs are based on their own sequencing reads and which SNPs are inferred from other SNP information.

rwdavies commented 3 years ago

Hi,

First, re: low MAF SNPs have higher accuracy. While that might be true, it might be that your measure of accuracy is not measuring something that is useful in the way you think it is. For example for a low MAF SNP or a moderate MAF SNP, if you scrambled sample ID labels (and thus there was no relationship between true genotype and imputed genotype), the low MAF SNP would have higher accuracy than the moderate MAF SNP, as more low MAF SNPs are homozygous reference which is the true genotype, but your results would be worthless. In this care r2 would be more valuable, and would be 0 for both sites (as makes sense - there would be no relationship).

About the other point, all SNPs are inferred from other SNP information. STITCH (like most imputation programs) uses the sequencing reads to infer who a sample is copying from against a set of reference haplotypes (here ancestral haplotypes built by the model), and then depending on who is being copied from, this gives the genotype posterior probabilities are (and hence things like dosage etc). Individual reads do have an affect at the SNP they intersect but this effect is small unless the coverage is really high.

Best, Robbie

Tengjun0520 commented 3 years ago

Ok. Thanks, Robbie