lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
34 stars 2 forks source link

Masking sample-wise #25

Closed capoony closed 3 months ago

capoony commented 4 months ago

Hi Lucas,

me again, another important feature which would be very useful is to use sample-specific masking conditions. Global masking may be useful to mask TE's etc. However, individual libraries may be characterized by differences in Read Depths which may require more specific masking for the individual samples.

In our case, we have individual BED/MASK files (in FASTA format) for each sample and currently, I need to break the input sample-wise and run grenedalf for each sample separately, which is quite an effort for >700 samples.

Is there a more elegant way to do that, e.g. by reading the BED files for each sample first and create a matrix with window-wise masks for each samples which can then be used to calculate averages?

Cheers, Martin

lczech commented 3 months ago

Hi Martin,

thanks for your patience, now getting back to working on grenedalf.

Great suggestion, and I'll get to implement this soon. I think a potential solution for this could be as follows: The masking as it is right now is merely another filter, where masked positions are not used in the downstream statistics computation. Any non-masked position however also undergoes any additional filters first (numerical etc, whatever the user provided), and then whatever remains after that is used for the statistic. That logic can easily be extended to per-sample masking as well, by simply having the mask do the same that the current global mask does, but on a per-sample bases. Any positions in a sample for which the sample mask tells us to not use the position are filtered out, any any that are not masked will then undergo all additional filters, and again, whatever remains after that will be used for the statistic. I think that would solve this feature request, right?

As for how to provide that: How about a simple two-column table file, mapping from sample name to mask file? That seems a bit easier than having users construct a matrix from their masks first.

Lastly, that all is I think independent of your other request (https://github.com/lczech/grenedalf/issues/24), which is about the window averaging. So, any masking per sample can be applied here first, and then the window average will be done on a global basis, so that it's the same denominator to get the window average for all samples. Or do you think having a per-sample denominator is needed as well? That would make it considerable more complex though, as in case of FST, that would need to be a per-sample-pair denominator, instad of a per-sample one.

So long Lucas

lczech commented 3 months ago

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers Lucas