Closed iqbal-lab closed 5 years ago
The output VCF from the infer command describes a single path through the PRG only. If quasimap coverage indicated that a reference allele should be selected in forming the path, then there would be no corresponding entry in the output VCF file. Non-reference alleles should have corresponding entries in the VCF.
We should discuss this - it should
I've pushed some modifications to dev branch which add sample column to the output vcf, producing the fields: 'DP:GT:COV:GT_CONF' made from parsing output of quasimap
and genotyping.
Also now there is a row for every site in the PRG, even if: i)The REF gets called ii)There is no genotype confidence (no coverage anywhere at variant site)
I also made two vcf output modes (specifiable at command line): i) 'Single' is sparse, it only outputs non-zero coverage alleles at genotyped sites. ii) 'Population' writes all alleles for all sites. I figured this would be useful when doing regenotyping of a set of samples, to easily combine them.
Option i) is useful for sites with v. long lists of ALTs. On TB simulated datasets it is >50 times less big than the 'Population' one.
Here is an example for 'Single':
##fileformat=VCF4.2
##source=gramtools, version 0.5.0, last commit f471a746037a7f8dc637474fa91ff925e0febf49
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total read depth from gramtools">
##FORMAT=<ID=COV,Number=R,Type=Integer,Description="Number of reads on ref and alt allele(s)">
##FORMAT=<ID=GT_CONF,Number=1,Type=Float,Description="Genotype confidence. Difference in log likelihood of most likely and next most likely genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT site.site_sim.iso.1.subject.simulated_reads.150.30.lab_id.simulated_reads.150.30.seq_reps.1
NC_000962.3 1000 . GTCTTAATACCCCTGGAATCGCCATCTTGTGCAAGATAGTACAGAGGAGC GAGCTGGAGACCCGCATCGCCATCTTGCGCAAGAAAGCACAGATGGA . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:26:0,26:1136.36
NC_000962.3 2005 . ACGGCGCTCGCGTGTTGAACAACTCATCGAGTGATGTGGAAGTGA AGCGGGGCGCTAGCCACGGGATCGAACTCATCGTGAGGTGAAAGGGC . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:29:0,26:828.16
NC_000962.3 3004 . GACGAGGGAGGTACACTCGTTGGGGACTAGGCCGGTGAGTGAAAGTGAC GCCGAGGAAGATCTTGTTGTTGACTATGCCGGTGAACCATTGACG . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:27:0,27:1120.67
NC_000962.3 4012 . CGAATATCATTCGGCAATCGGCAGAACCTCTATTGTCGGATTT CAGATATCGATCGGCAATTGTTAGCAGCTCGGCTGTTGGCGG . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:25:0,24:924.42
NC_000962.3 5005 . ATGTCGAAGTTGGGTCAGATCTAGACGCTCAGCTCGCCTAGAAAGCCTCGT ACGTCGATCGGCCCAGAACAAGGCGCTCCGGTCCCGGCCTGAGAGCCTCGA . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:19:0,19:1125.08
NC_000962.3 6008 . GTGCGCATTGTGGTGGTTACTTCGTGAAACACATAATACCTAGAAG GGTGGCCTGGTGGACTTCGTGAAACACATCAACCGCACCAAGAAC . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:27:0,26:1033.07
NC_000962.3 7006 . GATTAACGTCTGCTGGTGGCGGGCTTGCTGAAGGCCGGGAGGCAGTTCAT GCGACGGTCTGCTGGAGGCGGGGCTGAAGGCCGGGAAGAAGATCAACAAG . PASS SVTYPE=COMPLEX GT:DP:COV:GT_CONF 1/1:29:0,28:1001.57
this looks good!
we should organise a code review process
I've just run quasimap and infer on a single sample.
gramtools quasimap --gram-directory /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/gram_k11/ --reads SC_GMSM5876108_1.fastq.gz --reads SC_GMSM5876108_2.fastq.gz --output-directory quasimapk11/
took 40 hours on 1 CPU and 116Gb of RAM
gramtools infer --gram-directory /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/gram_k11/ --mean-depth 120 --error-rate 0.01 --quasimap-directory quasimapk11/ --output-vcf inferk11/inferred_k11.vcf
However the VCF contains only 6039 lines (not ~2 million) and has no sample column.
Looks like it terminated early, but with no error