no vcf file provided - SNPs with called genotype are imputed

SelinaKlees commented 4 months ago

Hello, I have WGS sequencing data, incuding the bam files and the called SNPs as vcf. I used now stitch to impute my SNPs based on the bam files and the SNP positions. However, I realized since I am not providing the vcf, Stitch also imputes genotypes that are actually not missing in the vcf files. So many genotypes of the called SNPs and the ones obtained from Stitch are not the same. Actually, I only want to impute the genotypes which are missing in my vcf. Is there a way to do that with Stitch? Best, Selina

rwdavies commented 4 months ago

Hi,

What's the depth of coverage? (And how many samples?) If the depth of coverage is higher, you might not want to use STITCH. If the depth of coverage is lower, you might not want to trust your called genotypes from the WGS directly.

STITCH does not have a way to simply impute some missing genotypes. If this is something like RADseq or GBS or similar, my suggestion would run everything through STITCH, then make a merged set with SNPs and genotypes from the RADseq, and then for SNPs not meeting a certain QC filter in the RADseq or GBS, use the imputed ones

Thanks, Robbie

SelinaKlees commented 3 months ago

Hi Robbie, thanks for the reply! I have WGS data for different coverages of 96 samples. Originally, we have ~22x but then we downsampled the reads to 8x, 4x, 2x, 1x, and 0.5x coverage. We wanted to compare these datasets to answer the question "how low can we go?". So I used STITCH for each of the six coverages. So does this mean STITCH can be seen rather as a new variant calling for low coverage sequencing data, rather than imputation of missing genotypes in an existing vcf file? Best, Selina

rwdavies commented 3 months ago

If you have data at high coverage (>10 X), you probably don't need to impute, if you can tolerate a moderate missing data rate, filtering out genotypes with low GQ (say below 10 or 20)

I would say that in its primary purpose, STITCH is neither a variant caller, nor designed for imputation of individual missing genotypes. It's designed for quite low coverages (<2X), where individual genotyping of variants in samples is impossible. It also doesn't do variant calling per-se, though it can help better determine which variants are likely true positive, as those variants that agree with their imputed background (have a high INFO score).

Hope that helps. One last comment, 96 samples is good, but at 0.5X, you might see much better accuracy if you imputed many more samples (e.g. 1000 samples). So I would take any results you get at the lower coverage as advisory, rather than definitive, if that makes sense (i.e. assume things might get better for more samples) (see the STITCH paper, we have a figure about this)

SelinaKlees commented 3 months ago

Thank you for the comment and advice! Yes, the low sample size will definitely be a discussion point in the manuscript. Best, Selina

rwdavies / STITCH

no vcf file provided - SNPs with called genotype are imputed #97