iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Missing sample column from infer output #136

Closed iqbal-lab closed 5 years ago

iqbal-lab commented 5 years ago

I've just run quasimap and infer on a single sample.

  1. Quasimap seems to work fine. I used this command:

gramtools quasimap --gram-directory /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/gram_k11/ --reads SC_GMSM5876108_1.fastq.gz --reads SC_GMSM5876108_2.fastq.gz --output-directory quasimapk11/

took 40 hours on 1 CPU and 116Gb of RAM

  1. infer takes 9 minutes and 1 Gb of RAM, and completes with no error

gramtools infer --gram-directory /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/gram_k11/ --mean-depth 120 --error-rate 0.01 --quasimap-directory quasimapk11/ --output-vcf inferk11/inferred_k11.vcf

However the VCF contains only 6039 lines (not ~2 million) and has no sample column.

Looks like it terminated early, but with no error

ffranr commented 5 years ago

The output VCF from the infer command describes a single path through the PRG only. If quasimap coverage indicated that a reference allele should be selected in forming the path, then there would be no corresponding entry in the output VCF file. Non-reference alleles should have corresponding entries in the VCF.

iqbal-lab commented 5 years ago

We should discuss this - it should

  1. Have a row for every site in the PRG
  2. Have evidence for its decision in each row In this case, it only has 6000 rows, not 2million (only 6000 non-ref calls is not credible), and no final column at all
bricoletc commented 5 years ago

I've pushed some modifications to dev branch which add sample column to the output vcf, producing the fields: 'DP:GT:COV:GT_CONF' made from parsing output of quasimap and genotyping.

Also now there is a row for every site in the PRG, even if: i)The REF gets called ii)There is no genotype confidence (no coverage anywhere at variant site)

I also made two vcf output modes (specifiable at command line): i) 'Single' is sparse, it only outputs non-zero coverage alleles at genotyped sites. ii) 'Population' writes all alleles for all sites. I figured this would be useful when doing regenotyping of a set of samples, to easily combine them.

Option i) is useful for sites with v. long lists of ALTs. On TB simulated datasets it is >50 times less big than the 'Population' one.

Here is an example for 'Single':

##fileformat=VCF4.2
##source=gramtools, version 0.5.0, last commit f471a746037a7f8dc637474fa91ff925e0febf49
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total read depth from gramtools">
##FORMAT=<ID=COV,Number=R,Type=Integer,Description="Number of reads on ref and alt allele(s)">
##FORMAT=<ID=GT_CONF,Number=1,Type=Float,Description="Genotype confidence. Difference in log likelihood of most likely and next most likely genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  site.site_sim.iso.1.subject.simulated_reads.150.30.lab_id.simulated_reads.150.30.seq_reps.1
NC_000962.3     1000    .       GTCTTAATACCCCTGGAATCGCCATCTTGTGCAAGATAGTACAGAGGAGC      GAGCTGGAGACCCGCATCGCCATCTTGCGCAAGAAAGCACAGATGGA .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:26:0,26:1136.36
NC_000962.3     2005    .       ACGGCGCTCGCGTGTTGAACAACTCATCGAGTGATGTGGAAGTGA   AGCGGGGCGCTAGCCACGGGATCGAACTCATCGTGAGGTGAAAGGGC .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:29:0,26:828.16
NC_000962.3     3004    .       GACGAGGGAGGTACACTCGTTGGGGACTAGGCCGGTGAGTGAAAGTGAC       GCCGAGGAAGATCTTGTTGTTGACTATGCCGGTGAACCATTGACG   .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:27:0,27:1120.67
NC_000962.3     4012    .       CGAATATCATTCGGCAATCGGCAGAACCTCTATTGTCGGATTT     CAGATATCGATCGGCAATTGTTAGCAGCTCGGCTGTTGGCGG      .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:25:0,24:924.42
NC_000962.3     5005    .       ATGTCGAAGTTGGGTCAGATCTAGACGCTCAGCTCGCCTAGAAAGCCTCGT     ACGTCGATCGGCCCAGAACAAGGCGCTCCGGTCCCGGCCTGAGAGCCTCGA     .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:19:0,19:1125.08
NC_000962.3     6008    .       GTGCGCATTGTGGTGGTTACTTCGTGAAACACATAATACCTAGAAG  GGTGGCCTGGTGGACTTCGTGAAACACATCAACCGCACCAAGAAC   .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:27:0,26:1033.07
NC_000962.3     7006    .       GATTAACGTCTGCTGGTGGCGGGCTTGCTGAAGGCCGGGAGGCAGTTCAT      GCGACGGTCTGCTGGAGGCGGGGCTGAAGGCCGGGAAGAAGATCAACAAG      .       PASS    SVTYPE=COMPLEX  GT:DP:COV:GT_CONF       1/1:29:0,28:1001.57
iqbal-lab commented 5 years ago

this looks good!

iqbal-lab commented 5 years ago

we should organise a code review process