rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

--genfile its value and generating the file #67

Open GuyReeves opened 2 years ago

GuyReeves commented 2 years ago

Hi

I am thinking about trying the option --genfile. As of my 7500 samples 50 of them have a sequencing coverage of >x15.

I was thinking of taking the file from --outputInputInVCFFormat = TRUE. and using the data from only the 50 high coverage samples to make the required file ( tab separated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA).

Does that sound like a plan that might work? The number of reads removed during the "generate inputs" look pretty modest for all samples.

Do you think that --genfile might be an option work trying? I have a large number of trios and my Mendelian error rate is very low; I was jus curious if it might be further improved - if it is easy for me to have a go--. Thanks Guy

rwdavies commented 2 years ago

This would work, though perhaps it's cleanest from a software evaluation perspective to use external software. I normally use the good old fashioned GATK 3 UnifiedGenotyper (now many years old!), as it is fast, and I can tell it to genotype sites given specific reference and alt alleles. I assume HaplotypeCaller can do the same thing, but am not entirely sure. I think samtools/bcftools can do the same thing. But this should work as well (though to be clear on phrasing: are you suggesting (2 = hom alt) (NA = missing) or (2 = (hom alt or missing)), I assume the former, thought phrasing unclear

GuyReeves commented 2 years ago

Hi

Yes it totally makes sense to use a real genotype caller rather than (mistakenly) rely on sufficient coverage at ever genotype. Particularly, as the --outputInputInVCFFormat does not really have anything to filter out weak sites

FORMAT=

FORMAT=

FORMAT=

I actually have individual .vcf files, so I just need to merge them, I was trying to be too lazy.

As far as phrasing, I took tthe text from the STITCH help, I understand it as 2 = hom alt and NA i= indicating missing genotype (pretty close to the "vcftools --012" option , but where "-1" needs to be replaced by "NA") Thanks

Guy

genfile
Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header

rwdavies commented 2 years ago

I mean, STITCH should do what GATK3 UG does, I think exactly, though I'm probably missing some things. And yeah, STITCH won't filter sites, so if you want annotations to do that, you'll need something else.

OK cool re: NA, sounds good