Closed karel-brinda closed 8 years ago
@biorelated: George, could you please check the alt_freq
branch if this is what you expected (allele frequency)?
Just checked out compiled and ran it on the same file with both default settings and with -x ococo64 in batch mode. The AF is different in each case. My guess is that the AF is not calculated based on the actual values (count/sum of nucleotides at each column) but on the value of the counters. Ideally the AF should remain constant regardless of the choice of the counter (-ococoa32 or -ococo64 or default).
with -x ococo64
465113 15107 . T C 100 PASS AF=0.73;CS=0,33,0,12;COV=45
with default
465113 15107 . T C 100 PASS AF=1.00;CS=0,5,0,0;COV=5
Same position, different values of AF.
I am also assuming a decimal number system for AF but AF is encoded different in each case?
Now I see the misunderstanding. We use the counters not because they could provide some better results than "full values" but because of memory constrains. It is not possible to reconstruct the "full values" from the counters.
You can imagine that -x ococo16
(default setting, 2 bytes per genomic position) provides some kind of approximation of the results (e.g., of AF
). Since the representation in memory is lossy, sometimes the answer will not be correct (like on the line from your example). With -x ococo64
you will have complete information with correct AF
(because counters will be big enough, no bit shifts will be done and no information will be lost) but it will take 4x more memory (8 bytes per genomic position).
I will try to make it clear from FAQ.
Many thanks! It feels it a balance between memory consumption/allocation vs what can be achieved in reasonable time. Did you create this tool to run on "standard" desktops? I can imagine running it on a cluster with enough memory, what we are talking about would not be a problem. But that may be in the future and depending on feedback from the community as well and the time you have. :) Cheers.
Add allele frequency to VCF.
Request from @biorelated.