karel-brinda / ococo

Ococo: the first online variant and consensus caller. Call genomic consensus directly from an unsorted SAM/BAM stream.
https://arxiv.org/abs/1712.01146
MIT License
47 stars 3 forks source link

Allele frequency #16

Closed karel-brinda closed 8 years ago

karel-brinda commented 8 years ago

Add allele frequency to VCF.

Request from @biorelated.

karel-brinda commented 8 years ago

@biorelated: George, could you please check the alt_freq branch if this is what you expected (allele frequency)?

george-githinji commented 8 years ago

Just checked out compiled and ran it on the same file with both default settings and with -x ococo64 in batch mode. The AF is different in each case. My guess is that the AF is not calculated based on the actual values (count/sum of nucleotides at each column) but on the value of the counters. Ideally the AF should remain constant regardless of the choice of the counter (-ococoa32 or -ococo64 or default).

with -x ococo64 465113 15107 . T C 100 PASS AF=0.73;CS=0,33,0,12;COV=45 with default

465113 15107 . T C 100 PASS AF=1.00;CS=0,5,0,0;COV=5

Same position, different values of AF.

I am also assuming a decimal number system for AF but AF is encoded different in each case?

karel-brinda commented 8 years ago

Now I see the misunderstanding. We use the counters not because they could provide some better results than "full values" but because of memory constrains. It is not possible to reconstruct the "full values" from the counters.

You can imagine that -x ococo16 (default setting, 2 bytes per genomic position) provides some kind of approximation of the results (e.g., of AF). Since the representation in memory is lossy, sometimes the answer will not be correct (like on the line from your example). With -x ococo64 you will have complete information with correct AF (because counters will be big enough, no bit shifts will be done and no information will be lost) but it will take 4x more memory (8 bytes per genomic position).

I will try to make it clear from FAQ.

george-githinji commented 8 years ago

Many thanks! It feels it a balance between memory consumption/allocation vs what can be achieved in reasonable time. Did you create this tool to run on "standard" desktops? I can imagine running it on a cluster with enough memory, what we are talking about would not be a problem. But that may be in the future and depending on feedback from the community as well and the time you have. :) Cheers.