Dxy calcuations - Githubissues

bioinfowheat commented 1 year ago

Dear Lucas,

great to see a revision of PoolSeq software for the next round of high throughput analyses. Exciting times ahead.

I know that you have already done a lot of work, but one thing that would help advance the community is the ability to generate Dxy per window, as a measure of divergence between populations.

why?

There is a lot of existing empirical and theoretical work that focuses on when Fst vs. Dxy are informative, and thus, it would be nice to have a framework where Fst and Dxy could both be estimated from PoolSeq. This was a limitation of the Popoolation2 package, that I think held back PoolSeq analyses from more commonly combining Fst and Dxy in their analyses.

Currently, in the individual based VCF world, this is done perhaps best via PiXY, which overcomes existing tools in accounting for missing data in most VCF based approaches for Pi and Fst analyses. https://pixy.readthedocs.io/en/latest/

Hopefully you can do something similar?

lczech commented 1 year ago

Dear @bioinfowheat,

thanks for the positive feedback, happy to hear that the tools is useful :-)

On the dev branch, I have now implemented an option in the fst command, called --write-pi-tables, which prints per-window values of pi within/between/total, in addition to the final fst values. In our grenedalf preprint, we explain these in the equations supplement. See equations 35-37 there; these values are now printed as well when using the new option. (For future reference: this is the equations document version from 2023-06-19)

As far as we can tell, pi between, as described there, is exactly Dxy. We currently scale it by number of variant positions in each window, but as that number is also printed in the same table, this scaling can easily be un-done, if needed (e.g., to scale by window size instead). Please let me know if that is indeed correct, or if you had something else in mind there.

One thing that we are aware of that is not properly implemented at the moment is the proper treatment of missing vs invariant sites (pixy describes this here). Currently, we scale by number of variant sites (snps), and hence treat missing and invariant sites the same.

I recently talked with @alanbergland about this as well, and it is on our agenda to look into this. One issue with implementing this properly is that not all file formats would allow us to do this correctly. While sam/bam files contain data for all positions, sync files might not contain invariant sites, so that we cannot distinguish between invariants and missing data... (@capoony et al introduced a gsync variant of the sync format that contains entries at all positions, similar to gvcf). So, of course it could be implemented for the case where the information is available. This is also related to https://github.com/lczech/grenedalf/issues/5, where @capoony suggested a solution that could help here.

Curious to hear your (all) thoughts on this - @MoisesExpositoAlonso, @jeffspence, and I are planning to reach out to the community about this as well in the near future, but maybe you already have some insights here.

So long Lucas

lczech commented 5 months ago

Hey @bioinfowheat,

just wanted to give a quick heads-up about the latest release grenedalf v0.5.0. This now finally implements a re-designed approach to the whole filtering and dealing with missing vs invariant sites, which also allows to properly normalize values per window, see here.

I think, with that, and the above-mentioned way of extracting Dxy from the FST command, you should be able to compute the statistics you need.

Hope that helps, and please let me know if that is what you were looking for! Going to close the issue for now, but feel free to comment, or re-open, or start a new one, if you have more questions or suggestions! Happy to get your feedback on this!

Cheers and all the best Lucas

lczech / grenedalf

Dxy calcuations #7