lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
33 stars 2 forks source link

masking when averaging by window-length #24

Open capoony opened 2 months ago

capoony commented 2 months ago

Hi Lucas,

thanks for the great new features -praticulary the averaging options are very useful. However, I am wondering if you could also implement the possibility to mask when in averaging by window-length. As far as I understand this is not possible yet.

In our case, we have already called SNPs and want to average diversity stats in windows (by window-length) based on a SYNC file with SNPs only. In addition, we have BED files, which contain the regions that should be masked in the genomes.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Thanks, Martin

lczech commented 1 month ago

Hi Martin,

thanks for your patience, finally circling back to working on grenedalf.

To make sure that I am understanding your suggestion correctly: Do you want to be able to provide a mask file that determines the denominator of each window? Such as: In a given window of length l, the mask file has x positions marked as masked out and y positions not masked, with x + y = l. Then, to compute the window average of a statistic, we simply divide the sum of per-site values of the statistic by y. In other words, a mask file that is taken as "ground truth" for the positions that shall and shall not be considered for the window averaging. That does indeed sound useful! I think that could be an additional type of mask, which would then only be used for the window averaging computation.

Would that be a solution to what you are thinking? If not, I am not understanding what exactly you would want to compute there - in that case, could you please clarify with an example?

Cheers Lucas

lczech commented 1 month ago

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers Lucas

lczech commented 1 month ago

Hey Martin,

just to also answer your comments in a bit more detail. I think I might have been a bit quick to close this issue - reopening now, so that we can discuss this.

When I am using --window-average-policy valid-loci in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz I get a lot of missing positions, but (I guess) the correct number of masked sites per window.

Yes, the missing positions are expected there, because when specifying a mask, we assume that this is to be taken as ground truth, and so in order to apply that, we need to fill in all positions for which there is no input as an empty missing position, to which then the mask can be applied. The alternative would be to keep the input without filling in the missing positions, but then the mask might specify positions as unmasked that are not there, and so they would still be ignored, despite the mask telling us that we should not.

Does that make sense? I could also deactivate that the missing positions are filled in when a mask is specified, and instead leave that to the user to decide (there is an option --make-gapless for that already anyway). Let me know what you think :-)

When I am using --window-average-policy window-length in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz I get no (or a few??) missing positions, but also no masked sites per window.

Interesting - you should see missing positions there as well, I think. Can you maybe test with the latest v0.6.0 again, and see if that changed? If not, can you maybe provide a minimal example for me to check?

As for the window average policy, I think that the valid-snps might be what you are looking for. If no other filters are applied, this will exactly take the positions of the mask as valid for the averaging.

Cheers Lucas