A small doubt - Githubissues

seoanezonjic commented 3 years ago

I'm using your program and it is a great software. I've tested your program and I would like to suggest that the attribute data file (with freq, detected SNP, allele, etc) would be in json format, that it is easily parseable. Respect of this file, I have the following question. In the output, I have one haplotype that has not SNP and I've assumed that this haplotype is the master/reference sequence. With this assumption, I think that this haplotype was the same sequence that was used in the read alignment step. I've checked this in my samples, aligning the haplotypes against my reference sequence and I've found that the haplotype given by clique-snv with no detected snv is different from my reference sequence. What means this? The virus population has fixed these changes? Could you add some processing step to detect this kind of changes?

Thank you in advance Pedro Seoane

vtsyvina commented 3 years ago

Hello, Pedro.

The SNPs are given with respect to the consensus of the sample, not the reference, since the algorithm does not have the reference. If the major haplotype has, let's say T at position 512 with 70% of reads supporting it, while the reference has C, the C at this position in other haplotypes will be reported as a SNP. Again, this is done because we have no knowledge of the reference from the sam file.

Regarding json format - I'll work on it soon.

Best, Slava

seoanezonjic commented 3 years ago

Thank you very much @vtsyvina for your quick answer Then, if I understand correctly your answer, the frequencies given by clique-snv are calculated taking into account the number of haplotypes generated by your algorithm (obviously, you use the read assignation to each haplotype to quantify the relative abundance). But my concern is that this ignores the reference sequence (reads that don' have any change) and I cannot know the frequency of the reference sequence that it's an haplotype of the virus population itself. I'll illustrate with an example, we infect a cell with a virus cloned sequence so all the haplotypes are the same. This sequences replicate with error into the cell and we have the following count (using 100 as total count of haplotypes):

25 don't have changes, so they are the reference sequence (freq 0.25)
20 have changes in positions 5, 100 and 190 (freq 0.20, if only haplotypes not reference: 0.26)
35 have changes in positions 5, 110 and 250 (freq 0.35, , if only haplotypes not reference: 0.46)
20 have changes in positions 100 and 250 (freq 0.20, if only haplotypes not reference: 0.26)

If the sequences that are not the reference sequence are discarded for the haplotype frequency, these frequencies are overestimated and discard one segment of the population that contributes to the virus fitness. There is some way to take into account the reads with no changes to estimate the frequency of the reference sequence?

vtsyvina commented 3 years ago

If the reference sequence is not dominant in the sample(let's say some haplotype got ~50% frequency) but still present the tool should report it as just another haplotype. Let's call reference haplotype A, then new dominant haplotype B and just one another C. For simplicity haplotype A consists of all A bases, if B and C has mutations they will be A>C. Now, B has mutations in positions 5,100, 190, then the consensus haplotype will be AAAACAAA....AC(100)A.....AC(190)A..., then the algorithm should be able to detect A as another haplotype with SNPs AAA[5,100,190] compared to the consensus haplotype. And if C has mutations in 10, 100, 190, 200 then it will be reported as having two mutations CC[10,200] (since positions 100, 190 are equal to the consensus).

You should understand the SNPs in output file relative to the consensus, not reference. Some samples may have 5 haplotypes of equal frequency, then the "consensus" haplotype doesn't even exist.

The frequencies are estimated using EM algorithm, it doesn't rely just on read counts(since different haplotypes have different number of SNPs and different distribution of them the raw read count doesn't give any good estimation).

My suggestion would be to analyze obtained sequences themselves, not just SNPs that the tool provides. That human readable file was not intended as variant calling information. It might be helpful for a very quick look at the results and more useful if we know that the consensus haplotype exists in the sample.

seoanezonjic commented 3 years ago

Hi @vtsyvina Your last answer is very clear, and I think that I understood the concept. Only to confirm, if there is one haplotype generated by clique-snv with an empty list of SNV, is the consensus one (the dominant haplotype). If I align all the haplotypes against the reference and one of them has the same sequence that the reference, this haplotype it's the reference. So, the frequency calculated for this haplotype represents the abundance of the reference sequence.

vtsyvina commented 3 years ago

Yes, it will be consensus haplotype for the sample. When there is only one haplotype reported it can be for several reasons:

CliqueSNV actually didn't find anything other than the consensus
The sample is too diverse and the program cannot execute in reasonable time, so it just outputs consensus

"So, the frequency calculated for this haplotype represents the abundance of the reference sequence." - yes

vtsyvina commented 3 years ago

The version 1.5.5 switched to JSON format. You can check it out

vtsyvina / CliqueSNV

A small doubt #8