Open GuyReeves opened 2 years ago
This would work, though perhaps it's cleanest from a software evaluation perspective to use external software. I normally use the good old fashioned GATK 3 UnifiedGenotyper (now many years old!), as it is fast, and I can tell it to genotype sites given specific reference and alt alleles. I assume HaplotypeCaller can do the same thing, but am not entirely sure. I think samtools/bcftools can do the same thing. But this should work as well (though to be clear on phrasing: are you suggesting (2 = hom alt) (NA = missing) or (2 = (hom alt or missing)), I assume the former, thought phrasing unclear
Hi
Yes it totally makes sense to use a real genotype caller rather than (mistakenly) rely on sufficient coverage at ever genotype. Particularly, as the --outputInputInVCFFormat does not really have anything to filter out weak sites
I actually have individual .vcf files, so I just need to merge them, I was trying to be too lazy.
As far as phrasing, I took tthe text from the STITCH help, I understand it as 2 = hom alt and NA i= indicating missing genotype (pretty close to the "vcftools --012" option , but where "-1" needs to be replaced by "NA") Thanks
Guy
genfile
Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header
I mean, STITCH should do what GATK3 UG does, I think exactly, though I'm probably missing some things. And yeah, STITCH won't filter sites, so if you want annotations to do that, you'll need something else.
OK cool re: NA, sounds good
Hi
I am thinking about trying the option --genfile. As of my 7500 samples 50 of them have a sequencing coverage of >x15.
I was thinking of taking the file from --outputInputInVCFFormat = TRUE. and using the data from only the 50 high coverage samples to make the required file ( tab separated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA).
Does that sound like a plan that might work? The number of reads removed during the "generate inputs" look pretty modest for all samples.
Do you think that --genfile might be an option work trying? I have a large number of trios and my Mendelian error rate is very low; I was jus curious if it might be further improved - if it is easy for me to have a go--. Thanks Guy