Suggestion: Include MASK files for correct calculation of per-site averages

capoony commented 1 year ago

Hi Lucas,

me again! I have another suggestion that may be useful as you are restructuring your code right now. One common problem with PoPoolation is, that it does not take into account that not only SNPs but also non-polymorphic positions may not pass heuristic SNP-calling paramters or a posteriori filtering. This commonly leads to an underestimation of per-site averages for any popgen statistics when simply dividing the sum of absolute values by the window-size. To yield correct values, the window-sizes need to be adjusted to account for all sites that do not pass the same criteria (heuristic parameters, removal of low-complexity regions, etc.) as the actual SNPs. We describe this issue in the Methods section of our paper (https://academic.oup.com/mbe/article/37/9/2661/5837682)

One way to pass such information is to either read BED or MASK files. The latter are FASTA-type files, where all positions with a 0 are retained and all positions >0 are ignored. See also the documentation of vcftools

>1
0000011111222...
>2
2222211111000...

Based on this information, the window-size could be adjusted to calculate correct per-site averages. Hope this is helpful and apologies for swamping you with stuff ;-)

Cheers, Martin

lczech commented 1 year ago

Hi Martin,

I have another suggestion

Keep them coming! Happy to get new ideas, even though I cannot always promise delivery times :-)

One common problem with PoPoolation is, that it does not take into account that not only SNPs but also non-polymorphic positions may not pass heuristic SNP-calling paramters or a posteriori filtering. This commonly leads to an underestimation of per-site averages for any popgen statistics when simply dividing the sum of absolute values by the window-size.

If you are talking about the diversity measures in PoPoolation (Theta Pi and Theta Watterson), their relative values are in fact computed using the number of covered sites in the window (see here), and not just using the whole window size. They also output the coverage fraction (number of covered positions in the window). Or did I misunderstand what you mean?

We describe this issue in the Methods section of our paper

You mean, how you obtain the correct denominator for computing relative Theta? Could you please point me to the section you are referring to? I'm not sure that I identified the correct passage about that in the paper.

One way to pass such information is to either read BED or MASK files.

That is a suggestion that @jeffspence also made, and should not be too hard to implement in grenedalf. I'll look into it!

One unsolved issue that I have though is as follows: Currently, grenedalf filters out everything unwanted (such as invariant sites, low coverage sites, etc) based on the input data, in one of the reading steps already. So at the time when the actual statistics are computed in the program workflow, only the sites of interest are processed, and all other information is "lost". As there are also input file formats that do not even contain invariant sites (meaning, for those formats, we do not have any coverage information for those sites anyway), I deemed that an appropriate way to handle the data that works for every file format - only keep what's interesting for downstream. That means though that relative Theta for example will be computed based on window size at the moment - and that indeed can underestimate the value, as @jeffspence has also pointed out before. I have an idea for refactoring the code to retain that information, if needed - but it would take some time, and would not work for all file formats...

Alternatively, the BED or MASK file approach would work (for all file formats). Just to clarify though: The way that you would want that to work is to create that file based on whatever criteria you have for your particular analysis, including to mask out low coverage invariant positions yourself, right? Or would you want that to be on top of keeping track of low-coverage invariant sites within grenedalf - kind of like combining the mask from the file with the "mask" created by keeping track of the number of low coverage positions in each window? That would be more complicated (and still not work for all file formats).

Happy to hear your thoughts on this and thanks again Lucas

lczech commented 1 year ago

Follow up question for @capoony and @jeffspence:

The mask file is easy enough to use for counting positions to be used for computing relative Theta, as the mask file format of vcftools gives information per position. Similarly, I can imagine to use BIM/MAP files of PLINK, where the provided positions are interpreted as "used", and non-provided positions as "masked".

However, formats such as BED, or for that matter GFF/GTF, describe regions, instead of individual positions. How would you suggest then to use those formats to compute relative Theta? I could imagine to use the "score" column of BED, and interpret that as number of non-masked positions per region? Or what did you have in mind?

jeffspence commented 1 year ago

I was thinking something like the --mask option in smc++ (https://github.com/popgenmethods/smcpp), where the regions specified in the .bed file are all of the positions that should be treated as missing. That is, the number of non-masked positions per region would be the number of positions in the region that are not covered by some interval in the .bed file.

lczech commented 5 months ago

Hi @capoony,

finally the (hopefully) proper ways of normalizing windows is now implemented in grenedalf v0.5.0, as you also saw from the email earlier today with @alanbergland. Just wanted to bring this up here, and finally close this issue. If you have any questions, comments, or issues with the new implementation, please let me know!

All the best Lucas

lczech / grenedalf

Suggestion: Include MASK files for correct calculation of per-site averages #5