lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
35 stars 2 forks source link

filter-sample-max-read-depth options #36

Open ribeirots opened 1 month ago

ribeirots commented 1 month ago

Would it be possible to have the max read depth be a percentile? For example, set it to 0.98 to exclude reads with depth in the top 2% highest coverage. Perhaps, if the value passed to the parameter is between 0 and 1 it could take it as percentile by default. Alternatively, could I pass different integers for each sample, similar to how I can pass different pool sizes? This way I could calculate the percentile myself before running the program.

lczech commented 1 month ago

Hi @ribeirots,

thanks for the suggestion - let's see if I am getting this correctly:

Would it be possible to have the max read depth be a percentile? For example, set it to 0.98 to exclude reads with depth in the top 2% highest coverage. Perhaps, if the value passed to the parameter is between 0 and 1 it could take it as percentile by default.

I am not sure that I understand exactly what you mean by: "exclude reads with depth in the top 2% highest coverage" - top 2% of what exactly? As far as I can make sense of this, you would like to have a threshold per sample, across all it's positions. This would require a complete reading of the data first to figure out the max read depth per sample, then set a 98% threshold based on that, and then reading the data again to apply it and compute the actual thing we want based on that filter. That double reading seems like a rather large downside, but I can see this generally being a useful way to specify filters.

Alternatively, could I pass different integers for each sample, similar to how I can pass different pool sizes? This way I could calculate the percentile myself before running the program.

That is definitely more efficient, as it would not require two passes through the data. Well, yes, once initially to calculate the percentile for each sample. But it would not require two passes every time then if you need to run grenedalf again, for instance to experiment with the effect of other settings. So that definitely sounds interesting, and might also be applicable to the other per-sample filter settings. I can see a bit of a potential for accidental issues with this as the filtering would not be consistent between samples though. Maybe @jeff_spence has some thoughts on this?

Please let me know if this fits with what you had in mind. Unfortunately, I am a bit short on time for working on grenedalf at the moment, but we can keep this issue open for when I get back to working on it. So, not sure that this will be available any time soon. In the meantime, to achieve the same, you can pre-filter your data manually. For instance, you can use the frequency command to get the per-sample read depth, and then run the sync command for each sample to apply a max threshold per sample.

Hope that helps, cheers Lucas

ribeirots commented 1 month ago

Thank you, Lucas. Yes, I think you understood my suggestion correctly. It is a feature that is implemented in PoPoolation, too - that is where I first saw this kind of filter. I am doing the manual pre-filter to achieve this filter for now.

lczech commented 1 month ago

Hi @ribeirots,

all right, just checked how PoPoolation2 is doing this - they indeed read the whole input data just to get those thresholds to get the value for a percentage of --max-coverage. As said, that is rather wasteful when having to process large amounts of files multiple times, so I think the option to allow specifying separate values for each sample, along with a command to compute this.

Also, PoPoolation2 has a percentage option for the --min-count, but that seems to operate on a per-position basis, so it would not need two passes through the data. The --min-coverage however is always applied for all samples, and without percentages. I am not quite sure if that is a statistical choice to achieve consistency though.

Let's see. Right now I am a bit short on time, but I'll hopefully be able to implement this eventually. It seems useful!

Thanks again Lucas