schneebergerlab / AMPRIL-genomes

scripts for the project of seven thaliana genomes assembly
35 stars 18 forks source link

Synteny diversity calculation? #10

Open markcharder opened 2 years ago

markcharder commented 2 years ago

Hi Wen-Biao,

Thanks for providing the code for your recent publication. Good work on this.

I just have a query about the calculation of synteny diversity.

In the publication, it is defined as: image

It looks like the script 'calculate.syn.diversity.pl' takes all pairwise comparisons and sums the number of times a site is non-synonymous then divides this sum by the total number of comparisons. Is this correct?

If i and j from the equation refer to all pairwise comparisons and pi_ij is either a 1 or a 0, or the fraction of 1s over all possible pairwise comparisons, this is not quite the same as what the script actually does.

For example, working with 11 genomes, we would have 55 pairwise comparisons. Let's say 20/55 comparisons evaluate to 1 as the sites are syntenic. The fraction of syntenic comparisons would be 20/55. This seems to be the number that 'calculate.syn.diversity.pl' gives us. However, this is not what one would get using the algebra in the equation.

For example: Pi=0 for(i in 1:10){for(j in (i+1):11){Pi=Pi+1/11*(1/11)*(20/55)}}

Would set Pi to about 0.17, whereas 20/55 is actually 0.36. If you keep 20/55 as the fraction of non-syntenic sites among all pairwise comparisons then perform the same analysis for all isolates including self-comparisons, for example:

Pi=0 for(i in 1:11){for(j in 1:11){Pi=Pi+1/11*(1/11)*(20/55)}}

Then Pi will be set to 0.36, which is the same as just outputting 20/55. Is this what the equation is describing? If so, one could omit the summation of x_ij and just say that this statistic is the fraction of non-syntenic sites among all pairwise comparisons. It would be a lot simpler.

Please can you provide a few more details on synteny diversity and how calculate.syn.diversity.pl works? Is my interpretation of this script correct?

Best wishes, Mark

markcharder commented 2 years ago

As an addendum to the above issue, I think I have worked out what is going on.

The sum of x_ij a constant always equates to 1 the constant. So for example:

4 sequences = 16 comparisons = sum_1..16(1/4 1/4 Pi_ij) = (1/4 1/4 Pi_ij) 16 = (1/16 Pi_ij) 16 = 1 Pi_ij

And calculate.syn.diversity.pl seems to calculate Pi_ij, which is:

non-syntenic sites / total pairwise comparisons (excluding self and backwards comparisons).

This is quite different from the original definition of nucleotide diversity, which multiplies the frequency of sequence i by the frequency of sequence j by the fraction of differences between them (either 1 or zero at a single site) and sums over this - the same as taking the number of differences / the total comparisons including self and backward comparisons

Just wondering if you can confirm this is correct? I like to have a firm definition of the method I am using before applying it.

Also, sorry for all of the questions and clogging up the issues page - I hope you don't mind!

Best wishes, Mark