luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Question about genotype call in GVCF #123

Closed bredelings closed 4 years ago

bredelings commented 4 years ago

Describe the bug I see some output in the GVCF that seems weird to me, but I don't know enough about VCF to know if this is actually unexpected.

Using the individual caller with --refcall POSITIONAL -P 1 I'm seeing lines like this:

NODE_1_length_502414_cov_40.174285      238754  .       A       <*>     194.56  PASS    AC=0;AN=1;DP=66;MQ=56;NS=1      GT:GQ:DP:MQ:FT  0:195:66:56:PASS
NODE_1_length_502414_cov_40.174285      238755  .       A       <*>     193.97  PASS    AC=0;AN=1;DP=68;MQ=56;NS=1      GT:GQ:DP:MQ:FT  0:194:68:56:PASS
NODE_1_length_502414_cov_40.174285      238755  .       AT      A       567.05  PASS    AC=1;AN=1;DP=68;MQ=56;NS=1      GT:GQ:DP:MQ:PS:PQ:FT    1:567:68:56:238755:100:PASS
NODE_1_length_502414_cov_40.174285      238757  .       A       <*>     197.45  PASS    AC=0;AN=1;DP=69;MQ=56;NS=1      GT:GQ:DP:MQ:FT  0:197:69:56:PASS
NODE_1_length_502414_cov_40.174285      238758  .       A       <*>     197.4   PASS    AC=0;AN=1;DP=69;MQ=56;NS=1      GT:GQ:DP:MQ:FT  0:197:69:56:PASS

The line I don't understand is the middle one with the variant. It looks to me like this is a variant supported by 1 read (AC=1) out of 68 (DP=68), and yet the ALT genotype is called (GT=1). I would expect the REF genotype to be called (GT=0). But probably I'm just misunderstanding how the format works.

It looks to me like the VCF outputs are calling a variant every time a single read differs from all the other reads. However, I confess that I have not actually loaded this up in IGV yet.

Previously I had left off the -P 1 and i was getting genotypes like 1|0 for all the variant sites, also with AC=1 or (I think) AC=2. However, in the current file, I don't see any cases of AC=2, which seems odd.

Version

$ octopus --version
octopus version 0.7.0 (develop ba99b500)
Target: x86_64 Linux 2.6.32-642.13.1.el6.x86_64
SIMD extension: SSE2
Compiler: GNU 9.1.1
Boost: 1_73

Command line to run octopus:

$ octopus -R contigs-1k+.fasta -I contigs-1k+.bam -o contigs-1k+.g.vcf.gz --threads 32 --refcall POSITIONAL -P 1
dancooke commented 4 years ago

The INFO/AC annotation refers to the number of called alleles (i.e. summing over GT entries, for each allele). For single sample haploid calls you'll only ever see AC=1 for variant sites (for diploid it would be max AC=2 with GT=1|1). This annotation is mostly used for population analysis where the population allele frequency AF=AC/AN is useful. What you're looking for is AD (Allele Depth). Octopus doesn't output this by default, but you can request it using the --annotations option, so long as it's an active filtering measure (see this discussion).

bredelings commented 4 years ago

Thank you very much. I ran an analysis with --annotations AD and put AD<1 into the filter. However,

Example:

NODE_2_length_375716_cov_40.650521      207843  .       A       AT,ATT  21.41   PASS    AC=1,1;AN=2;DP=69;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:AD:FT 1|2:20:69:59:207843:100:23:PASS

I'm guessing that your AD number is summing over the ALT alleles. Would it be possible to output a number for each allele?

dancooke commented 4 years ago

As of 2644fe1ee91d1019231dcb9f01ed8ad1f3bd5b39, Octopus now reports relevant annotations, including AD, on a per-allele basis as expected.