lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
35 stars 2 forks source link

subsampling vs coverage thresholds #16

Closed r-mashoodh closed 5 months ago

r-mashoodh commented 11 months ago

hi,

thanks for developing this -- it looks really useful.

I had more of conceptual question. In the popoolation manuals it always suggests subsampling pileups to create equal coverage. Is that something that is still necessary/useful here? And if so, how does one decide either on the subsampling value, or min/max coverage?

cheers and many thanks in advance.

lczech commented 9 months ago

Hi @r-mashoodh,

very sorry for the very late answer here, somehow missed to reply to a few issues here completely.

I assume you are referring to the sections "basic-pipeline/subsample-pileup.pl" and "Calculating Tajima's D" of their wiki Manual? Hm, I am not entirely sure why they recommend that. Knowing their implementation, it is definitely going to make the computation faster - maybe that's the reason? Because computing some of the correction factors for Tajima's D in PoPoolation is slow, so maybe they recommend this to avoid long run times due to a few very high coverage positions? Generally, I would assume that having more data gives more reliable results. Maybe @jeffspence has some statistical intuition about why that might be beneficial?

If it is about the computation time though, this is something that has been improved by orders of magnitude in grenedalf compared to PoPoolation. So, you can try and leave your high coverage positions in there, and I hope that it's still fast enough for your data. If not, then subsamling will also help in grenedalf, as it avoids spikes in computation time for high coverage positions as well.

Hope that helps for now, and sorry again! Lucas

jeffspence commented 9 months ago

I don't think there are any statistical reasons for downsampling coverage, although I could see filtering out very high coverage sites because they’re likely duplications (or weird in other ways). Otherwise, some of the estimators (e.g., Tajima's D) require computation that scales quadratically in the maximum coverage, so downsampling could help with runtime. But if runtime is not an issue, then I don't see an obvious reason to downsample.

lczech commented 8 months ago

Hey @r-mashoodh,

just quickly wanted to check in with you if the answers above helped you. Also, there is now a new option implemented, currently only on the dev branch, called --subsample-max-coverage, which might be interesting for you. See a more detailed explanation as well as some current caveats here.

Cheers and so long Lucas

r-mashoodh commented 8 months ago

hi @lczech

thanks for the response! I'm pretty puzzled about the subsampling of the pileup step suggested by popoolation, which is why I wanted to ask you what you thought about it.

I found the following justification:

"Several population genetic estimators are sensitive to sequencing errors. For example a very low Tajima’s D, usually indicative of a selective sweep, may be, as an artifact, frequently be found in highly covered regions because these regions have just more sequencing errors. To avoid these kinds of biases we recommend to subsample to an uniform coverage."

from this tutorial: https://www.kofler.or.at/bioinformatic/wp-content/uploads/2018/07/pooledAnalysis_part1.pdf

it sounds like from what you are saying it reduces computation time but there isn't a real reason to sample to uniform coverage? I guess this depends in part on how the --filter-sample-min-count flag works?

this is a sample command of mine (pools are sequenced ~80x):

grenedalf diversity --sync-path pops.idf.tef.sync \
--filter-sample-min-count 2 \
--filter-sample-min-coverage 12 \
--window-type chromosomes \
--pool-sizes 40 \
--file-prefix diversity \
--rename-samples-file sample_names.txt \
--measure all

do you have any advice then on setting --filter-sample-min-coverage and/or --filter-sample-max-coverage?

thanks!

ps. is there somewhere in the documentation that defines the diversity all output columns, specifically the 'abs' vs 'rel' distinction?

lczech commented 8 months ago

Hi @r-mashoodh,

thanks for finding the source of the justification for subsampling, that is interesting! I wasn't aware of that. Hm, might make sense to do it then, depending on your data and situation. I've just added this justification to the new --subsample-max-coverage option. Also, I've fixed a little thing there concerning the N and D counts - please get the latest dev branch if you are planning to work with this option.

it sounds like from what you are saying it reduces computation time but there isn't a real reason to sample to uniform coverage?

It does indeed reduce time for the diversity estimators by a bit, if you have regions with very high coverage (10k or more). Otherwise, it's not super significant, at least with my test data. Mileage may vary though, so better test yourself. But also, yes, according to your linked justification, it seems it might make sense to subsample anyway, if you have regions with spuriously high coverage.

I guess this depends in part on how the --filter-sample-min-count flag works?

Not sure what you mean - the subsampling is more about the maximum coverage, not the minimum count for a particular base. That option is meant to filter out counts that are likely just sequencing errors. Ah okay, I see, in your quoted text, it says that subsampling can avoid reaching the min count by accident. Yes, in that sense, these options interfere with each other.

do you have any advice then on setting --filter-sample-min-coverage and/or --filter-sample-max-coverage?

Hm, not more than: It depends... The statistical reasoning and biological interpretation of these filter settings depends on characteristics of your data, and what you need to filter for... Not sure that there is a one-size-fits-all recommendation here. Maybe @jeffspence has some insights?

ps. is there somewhere in the documentation that defines the diversity all output columns, specifically the 'abs' vs 'rel' distinction?

Working on it. I'm currently re-working a lot of the code surrounding the whole filtering, and there will be changes in the next release version, and hopefully more documentation as well!

Cheers and so long Lucas

lczech commented 5 months ago

Hi @r-mashoodh,

finally finished with the re-design of the filtering and all that, see the latest release grenedalf v0.5.0. The wiki now also contains more details on all the steps, and also on the output columns, which are now re-designed and re-named as well, see here.

With that, it seems the issue can be closed. If not, feel free to re-open or start a new one if you have more questions!

Cheers and so long Lucas

r-mashoodh commented 3 months ago

thanks for all the changes and the associated documentation. I had one silly question and I wonder if I have just missed this ... what exactly does --pool-sizes refer to, is this the number of alleles or number of individuals.

I assume it is similar to Popoolation where if there are 40 diploid individuals pool size would be 80?

lczech commented 3 months ago

Hey @r-mashoodh,

that's not a silly question at all - and I just noticed thanks to you that I should probably document that in a bit more detail :-)

The pool sizes are indeed the number of haploids, so yes, if you have 40 diploid individuals, your pool size would be 80.

I'll add a bit to the documentation about that, thanks for pointing that out! Lucas

r-mashoodh commented 3 months ago

Thanks, @lczech!

Another question -- I noticed that the subsampling isn't implemented in the FST calculations, is there any particular reason why?

lczech commented 3 months ago

I don't think it's needed there. It's mostly to avoid overly long computation times for the diversity metrics, which have a term that depends on read depth in them. It should be fast enough still for most "normal" read depths, but for instance if you are merging samples via the grouping option, they can get out of hand, leading to overly long computations. The subsampling option helps to reduce the time there.

For the FST computation, there is no dependency in the equations on read depth, so we can just use the full data that is there. However, if you have a use case where you want to subsample, let me know, it's easy to add if needed.

r-mashoodh commented 3 months ago

I don't think I need it but good to understand it. Thanks for the explanation!

lczech commented 3 months ago

Sure :-)

I wasn't quite sure if I should add it to FST as well - it just did not seem like it would do anything useful, and instead just reduce the amount of usable information. But if you end up needing this, let me know!

r-mashoodh commented 3 months ago

Will do thank you @lczech!

Can I ask another question -- do you think there is a reasonable way to to estimate Ne from pooled-seq data. We know that in theory: Ne = pi/4mu (where mu is mutation rate). Could that work or is there a better way to get this info?

lczech commented 3 months ago

Phew, I'm not a pop gen statistician by training, so I'm always careful with these things... @jeffspence would be a better person to ask that.