karel-brinda / ococo

Ococo: the first online variant and consensus caller. Call genomic consensus directly from an unsorted SAM/BAM stream.
https://arxiv.org/abs/1712.01146
MIT License
47 stars 3 forks source link

How do you interpret the output? #14

Closed george-githinji closed 8 years ago

george-githinji commented 8 years ago

With -V and real-time modes, how do you interpret the VCF file? What is SUM and COV really mean? and what is the relevance of CS info fields? There are also multiple values per genomic position, how do you interpret this?


9465113 14942   .       T       A       100     PASS    CS=6,0,0,4;SUM=10;COV=10
9465113 14942   .       A       T       100     PASS    CS=2,0,0,4;SUM=6;COV=6
9465113 14942   .       T       A       100     PASS    CS=5,0,0,3;SUM=8;COV=8
9465113 14942   .       A       T       100     PASS    CS=4,0,0,6;SUM=10;COV=10
9465113 14942   .       T       A       100     PASS    CS=6,0,0,4;SUM=10;COV=10
9465113 14942   .       A       T       100     PASS    CS=4,0,0,6;SUM=10;COV=10

On the above example, there are multiple VCF rows and variable info fields values. How would I interpret this?

Thank you

karel-brinda commented 8 years ago

Hi George,

each line corresponds to an update of the internal consensus. CS are values of A,C,G,T counters (see the paper for more info), SUM is their sum. COV is "truncated" coverage (it can be different from SUM when counters have non-zero initial values).

In your example, the consensus at this position was updated several times T => A => T => A ... It is probably a heterozygous SNP position and OCOCO is oscillating between both SNPs (A/T).

george-githinji commented 8 years ago

Thank you very much for the clarification. In the above example we have 6 oscillations 3 are T=>A and 3 are A=>T. From a user point of view, how do I tell with confidence that this position was a T=>A or an A=>T? I have looked through the paper. There are a lot of details but requires a careful read. It is not obvious what is a counter and the effect from the user's perspective of using different counters and how it affects the results and interpretation.

Sorry to bother but given the example above, do you mind to explain it for me? I will definitely read through the paper in more detail but I want to assess if I can use this caller for a minority variant pipeline and how. I liked the idea of updating the reference based on the consensus. Many thanks.

karel-brinda commented 8 years ago

Hello, thank you for this feedback. I will write some simple tutorial and append it to the readme (with basic use cases). I think that, in your situation, batch mode could be more appropriate. Could you test it and tell me if this is the output you want to see?

Concerning the mechanism behind, OCOCO uses for every position 4 small counters, which occupy just several bits (but you can adjust this size by the -x parameter). After processing every new read, it checks if the majority nucleotide changed at some position. If yes, an update is reported. OCOCO currently supposes haploid genomes so the results may be unstable at heterozygous positions (may report many updates between both SNPs present there). If you want to use the caller in 'offline style' (reporting all updates after processing all reads), the batch mode is appropriate.

george-githinji commented 8 years ago

Thank you! I have used the batch mode and as expected there are fewer rows with one row per position. Position 14942 in the earlier example has been omitted in the batch mode results.

What is the meaning of COV? There is a disclaimer that "correct if no shift has been performed and reference nucleotide was unambiguous", so how should this be interpreted? The values reported are way lower than the actual read coverage at this position. I still don't understand your definition of a counter. Maybe you could break it down further. I want to imagine it is the number of bases that support either the alternative or reference reads. and how come the value of SUM is equal to COV across all instances. Are there situation they could be different? and if not why the redundancy? Is it possible to report the raw read coverage at each position and an allele frequency?

karel-brinda commented 8 years ago

A counter is a variable for counting nucleotides. Each position has 4 counters. To be memory efficient, these counters are small and cannot represent higher values than 7 (in default setting). When you use -x ococo64, they can represent numbers from [0,32767]. When you use -x ococo32, they can represent numbers from [0,127].

COV = sum of all four counters at a given genomic position. When a counter gets saturated, a shift of all counters at the position must be performed (so all values at the position are divided by 2). If you use counters big enough (-x ococo64), COV will be really the coverage.

karel-brinda commented 8 years ago

COV and SUM are different only in the case when you use --ref-weight with a nonzero number.

Allele frequency can be obtained in the following way. When you have, e.g., CS=6,0,0,4, allele frequency for A is 6/10 and 4/10 for T. I can add it to the output if you want (as another quadruple of numbers).

george-githinji commented 8 years ago

Thank you once again for the explanation. It is very helpful. I think it would be good to add the allele frequency to the output! It will be easier to parse rather rather than computing separately. Thank you very very much! I will send you an email in case of more queries or discussions.

george-githinji commented 8 years ago

In regard to the previous query and your explanation, I feel like it would be good to provide a flag to report the exact coverage. This way, there is no need to place a disclaimer when a shift is made, otherwise if a shift is made, COV is inaccurate and it makes me wonder what is its relevance to the user if it is only required for computing some internal value/ function. Another way is to provide a function that maps the counter value and coverage to their exact or absolute values. A variant is only relevant when the read depth at that position is good (or known) and the base qualities are also high. That does not necessary mean it is always true because PCR errors could have resulted from the amplification process.

karel-brinda commented 8 years ago

The original purpose of OCOCO are updates in dynamic read mapping. In such a case, only very limited memory can be available. This is the reason why those tricks with small counters are used (to update the reference in dynamic mapping, nucleotides frequencies provide enough information, we don't need exact coverage). If you want to have exact coverage, use -x ococo64 (it will be exact if coverage does not exceed 32767).

Concerning the flag whether coverage is exact, I will think about it (I can use a Bloom filter for it). I will create a new issue for it.

Thank you for all your remarks and comments.

george-githinji commented 8 years ago

Thank you for a helpful tool!

karel-brinda commented 8 years ago

Please, could just send me an e-mail with a short description of how you use OCOCO? It would be very helpful for me (to understand in which all situations OCOCO can help). My e-mail address is karel.brinda@gmail.com.