diversity calculations (pi) seem to calculate based on windows in an unexpected way

TylerAudet commented 9 months ago

Hi Lucas,

I've noticed my theta pi numbers were far, far higher than I would anticipate. I'm calculating based on sliding windows of 10000 positions. I found one high pi area to look at what was going on. That region looks as follows:

2L      5776    C       0:0:32:192:0:0
2L      6353    C       0:183:30:0:0:0
2L      6921    G       0:192:0:27:0:0
2L      7542    G       0:169:0:45:0:0
2L      7801    C       0:161:36:0:0:0
2L      8667    G       0:0:29:185:0:0
2L      8774    T       0:27:172:0:0:0

So in a 10000bp window I have 7 segregating sites with reasonably good and consistent coverage. If we calculate Per site nucleotide diversity (based on the naive calculation of per site expected heterozygosity) would be 0.00016. However, I get 0.269 which is reasonably close to the value expected based on only the 7 segregating sites (~0.23). So it seems the windows are not behaving as I would expect? It is calculating based on the just the bi-allelic sites, rather than the entire window. Is there a way to set this to acknowledge the rest of the window that I missed?

Thank you, Tyler

lczech commented 8 months ago

Hi @TylerAudet,

very sorry for the very late answer here, somehow missed to reply to a few issues here completely.

So, the way this currently works is that we stream through the input, apply all filters as provided, and then continue to the statistics computation with whatever positions remain after filtering. Let C be the that set of all positions in your input that were present and not filtered out by any filter setting.

Then, the output table for the diversity command contains two columns for that, theta_pi_abs and theta_pi_rel. The first one is absolute, and simply the sum of all pi for positions in C. The second column is relative, and is computed as the absolute value divided by the number of positions in C (SNPs and non-SNPs - but excluding all positions that were either filtered or not present in your input in the first place).

With that in mind, do your numbers make sense? From the information you provided, I cannot tell whether your data only contains those 7 positions in your exemplary window, and no data in any other position, or if you just extracted those positions for simplicity here. Either way, you can check if the numbers work out that way :-)

If you need a different way of computing relative pi, for now I'd recommend to use the absolute value instead, and compute the relative one according to your needs yourself.

That being said, this way of normalizing the value has (rightfully) been criticized (see several of the other open issues here), and will change in the near future. I am in the middle of implementing this in a way where all information is kept all the way through the computation, instead of just removing data in the filtering. So, with the next release, this will change and more options will be available for this.

Hope that helps for now, and please let me know what you think, if that makes sense, and what other suggestions you have to make this proper. Now is a good time for suggestions like that, as I am working on this at the moment!

Cheers and so long Lucas

lczech commented 5 months ago

Hi @TylerAudet,

took me a while, but the latest release grenedalf v0.5.0 now has a completely redesigned approach for calculating the diversity statistics. We now offer --window-average-policy (see here and here) to help with the issue that you observed, along with mask files to handle invariant vs low quality sites, and a re-designed filtering, see here.

Please have a look - I'd recommend to read through the General usage pages on the wiki. I think, with all of those, your issue should be fixed, and hence, I'm closing the issue now. However, if your issue persists, please let me know by re-opening this, or starting a new issue!

Cheers and so long Lucas

lczech / grenedalf

diversity calculations (pi) seem to calculate based on windows in an unexpected way #17