lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
33 stars 2 forks source link

Question: How does the SNP window work? #10

Closed SheepwormJM closed 3 months ago

SheepwormJM commented 1 year ago

Hi,

I've been plotting files and realised that the SNP windows are the same co-ordinates for all my pairwise comparisons. But that some have 'nan'. Does this mean that windows are determined as those which hold 100 SNPs across all samples (I asked for 100 SNPs), but when two samples being compared have fewer than 100 SNPs in a given window, the statistic for that window for that comparison is not calculated?

Thanks, Jenni

lczech commented 1 year ago

Hi Jenni,

what exact command and window type are you referring to? It sounds like --window-type queue? The way that this is currently implemented is unfortunately not yet fully documented, and also probably neither intuitive nor the best way to solve this... It behaves differently for different commands, as a necessity of how those commands compute their values. For instance, diversity metrics need all positions (not only the SNPs) to get the correct statistical values, while FST can only be computed from SNPs. Hence, the queue windowing type is doing different things in these two commands. I am not really happy with that, but have not yet properly figured out what the "correct" way for this would be. Maybe @jeffspence has an insight to offer here?

Hope that helps for now, and happy to get suggestions on what you would that window type expect to do... and please excuse the current lack of documentation there - working on it! In the meantime, it might be easier to use windows that are fixed along the genome (sliding) instead.

Cheers Lucas

SheepwormJM commented 1 year ago

Hi Lucas,

Thanks. It's the queue window, yes.

This was my command:

grenedalf fst --sync-path ~/input.java.sync --reference-genome-fasta-file ~/genomic.fa --sample-name-list ~/Expt_3_sample_names.txt --filter-sample-min-count 2 --filter-sample-min-coverage 100 --filter-sample-max-coverage 100 --window-type queue --window-queue-count 100 --window-queue-stride 100  --method unbiased-nei --pool-sizes 400 --omit-na-windows --comparand-list ~/Expt_3_pairwise_comparisons.txt --separator-char tab --out-dir ~/grenedalf --file-prefix Expt3_subsampled100_100SNPwin_fst --log-file grenedalf_Expt3_subsampled100_100SNPwin_fst.log 2>error_log_100SNP_fst &

It really helps with resolution, and most windows say they have 100 SNPs in them, but there is at least one each time that has fewer (summary in R - I've assumed it could have been the last window on the chromosome, but have never checked, sorry). Some pairwise comparisons have nan.

For example:

chr4    1141327 1142722 100     -0.00695203     0.000231965     0.00778936      -0.00604267     -0.00779172     0.00130702      0.00371063      -0.00450315     0.00173352      0.146387        0.240648  -0.00491509     0.00638189      0.00673182      0.0243291       0.0179133       0.0189788       4.22884e-05     0.0154784       0.0376052       0.00417511      -0.000195074    0.113194        -0.00165458     0.0227293
        0.0392787       0.00784316      -0.000459629    0.0131253       -0.003023       0.143592        0.0344057       0.00638734      0.0889148       0.0185885       1.36238e-05     0.136934        0.243194        0.20307 0.255121  0.250249        0.187066        0.0252403
chr4    1142726 1144312 100     0.0107968       nan     nan     nan     nan     0.0107968       0.0102428       0.0116685       nan     0.0902655       0.163268        nan     nan     nan     nan       0.0107968       0.0102428       0.0107968       nan     0.024114        0.00936427      nan     0.0231394       0.0169306       0.0107968       0.0176021       0.0116685       nan     0.00936427      nan     0.0235505         0.0169306       nan     0.0225644       nan     nan     0.128927        0.164123        0.16212 0.141333        0.160603        0.162222        0.0140421
chr4    1144320 1145862 100     -0.00587571     -0.0050662      0.00193706      -0.00541077     -0.00580294     0.000374341     -0.00211828     -0.00525584     -0.00478518     0.0693279       0.0701942 -0.00477734     0.00980586      0.0356659       0.00407541      0.0168834       0.0189759       0.0272585       0.0069936       0.00855553      0.000326853     0.00155089      0.0159223       -0.00173315     0.0353451         0.00824619      0.000289347     -0.00016709     -0.00157581     0.0312525       0.0167464       0.0121889       0.00478186      0.0278584       0.0218964       0.00117005      0.0513719       0.071085        0.0831691         0.0730253       0.0781615       0.093269        0.00951752

So, the window co-ordinates are the same for all pairwise comparisons. But I find it odd that they all have 100 SNPs in them so regularly. I have fewer than 1500 windows with NAN in them across the entire genome, but I have over 100k windows. Some pairwise comparisons have <10 or zero windows without an Fst value computed.

The resolution seems to improve with this method (over 10 kb sliding windows), but would be good to know what it is doing. I'm was using it with the idea that having the same number of SNPs contributing to each Fst value would reduce/counteract the variance seen from noise.

Does that make sense? All the best, Jenni

lczech commented 1 year ago

Hi Jenni,

thanks for your patience with my reply! I think this might be a misunderstanding of how the queue window type works. Let me maybe rephrase what that window type does in more detail, to make sure we are on the same page.

With that in mind, now some answers to your points that hopefully make sense now:

I've assumed it could have been the last window on the chromosome, but have never checked, sorry

Yes, that should be the case - all windows have 100, because that's what that type of window does, until it runs out of data, which will be at the end. The difference is the width of each window - it will span however many positions are needed to find 100 non-filtered SNPs.

By the way, this is also what I was referring to earlier when I said that the behavior differs for different commands. The diversity command needs all positions (not just SNPs), and hence does not apply the additional SNP filter (unless it's explicitly specified). I am not really happy with that implicit difference in commands, and will probably fix that... But for now, that's how it works.

Some pairwise comparisons have nan.

Yep, that would happen if some samples have no data at a given interval. Only if all pairwise fst for a given window are nan we will omit them due to --omit-na-windows, but as long as at least one of them has a valid fst value, the window will be reported.

So, the window co-ordinates are the same for all pairwise comparisons.

Yes, if you want to compute fst in a way where the 100 SNPs that are used are pairwise between samples, instead of total across all samples, you will have to run grenedalf individually for each pair, instead of running it once with all samples. However, that would lead to the queue windows having different start and end positions for each pair of samples, depending on where these samples have coverage and SNPs and all that, and hence make downstream comparisons between samples more difficult. The way it's implemented now is meant to make it easy to compare across samples, as the windows are the same across all pairs of samples.

But I find it odd that they all have 100 SNPs in them so regularly.

Yes, hope that makes sense now with the above explanation. Basically, window widths (start and end positions) are so that they contain exactly 100 SNPs. If you want fixed sized windows in terms of equal length along the genome instead, use the sliding window type.

Hope that makes sense now. If not, please let me know. Also, if you think that this is not the best way to do this - I'm open to suggestions on how to improve this. As I said, I'm not super happy that the behavior of the queue window type depends on the command, and will work on that.

Cheers and so long Lucas

SheepwormJM commented 1 year ago

Hi Lucas,

Thanks for this explanation! I was chatting with someone about it yesterday, and he had pointed out that as I was using a sync file, which I had also subsampled, all samples would have had coverage at all positions included. So this might have made it more likely that all pairwise comparisons would have had the same SNP positions/same sites.

I love your explanation above - that's really clear.

We assume some input chromosome, and some input samples. If not all positions are present (e.g., due to low coverage), that does not matter - whatever is there is propagated as-is. That data is now read position by position, and all your provided filters are applied to it. After that, additionally, the fst command filters out everything that isn't a SNP (across all input samples - if the total has more than one non-zero allele count at that position [after all the filters], the position is considered to be a SNP). What remains is a list of positions that passed all filters and are SNPs. Some samples might not have data in that window, but that's okay - as long as some samples overall contain data, that position will be taken into account. With --window-type queue and --window-queue-count 100, this list is now "queued" (i.e., chunked or split) into windows of 100 positions exactly, and the statistic is computed on each window. The output that you provided contains start and end such as 1141327 1142722, meaning that that window spans 1395 (difference between the two) positions in your input. These positions contain a total of 100 "interesting" positions (after all the filtering).

I guess my one outstanding question is just this: as I've run all my populations together, but not all might have 100 SNPs (maybe they have a fixed allele at a handful of positions, so only have, say 94 SNPs in the window), do they then have NaN, or is an Fst value still computed? Same question I suppose if I had used, say, bams as input and had some without coverage at a site within a queued window (so not a subsampled sync).

I am mainly trying to figure out if all my Fst values calculated for all pairwise comparisons for a given window would have equal weight - i.e. all have had 100 SNPs contribute, or are ignored. I think that my main idea that they might have an Fst calculated with fewer than this (so pwc1 = 100 SNPs contribute, pwc2=95 SNPs contribute, pwc3=72 SNPs contribute etc) is because there is that last window with the leftover SNPs (e.g. 44 rather than 100 positions).

Or does it work like this:

  1. If 100 SNPs, then 100 SNPs contribute to Fst
  2. If <100 SNPs, but all have coverage, then those which are fixed give a value of 0, but are included in the Fst, so still equal number of sites (that are SNPs in at least one sample) contribute.
  3. If < 100 sites have coverage window is excluded for that pwc by nan.

Sorry if I'm making this more complicated than it is. It's clearly improved the noise in my data, just trying to figure out how consistently it works.

Thanks,

Jenni

lczech commented 11 months ago

Hi Jenni,

Thanks for this explanation! I was chatting with someone about it yesterday, and he had pointed out that as I was using a sync file, which I had also subsampled, all samples would have had coverage at all positions included. So this might have made it more likely that all pairwise comparisons would have had the same SNP positions/same sites.

Ah I see, that makes sense. Which tool did you use for the subsetting?

I guess my one outstanding question is just this: as I've run all my populations together, but not all might have 100 SNPs (maybe they have a fixed allele at a handful of positions, so only have, say 94 SNPs in the window), do they then have NaN, or is an Fst value still computed? Same question I suppose if I had used, say, bams as input and had some without coverage at a site within a queued window (so not a subsampled sync).

Good question. The windows are assembled taking all samples into account, as explained above. That is, each position that passed all filters is taken as a position for the window. If any particular sample does not have any counts there, it's just included as 0s, as that doesn't make a difference for FST (no need to mask it or anything like that). Then, FST is computed for each pair of samples individually. So, say, in a given window a sample only has 94 non-zero positions, only those are really used in the FST computation. Well - it's implemented in a way where everything that yields an invalid result is just skipped - and computing FST of all zeros has a division by 0, so it's skipped. So yes, per position and pair of samples, you'd get NaNs for "missing" data, but those are just skipped then. Same for bam files with no coverage - just more 0s.

That being said, it does make a difference for the diversity statistics, as those normalize by number of positions, for which we might want to differentiate between missing data, invariant sites, etc. That is currently not implemented though, see here as well.

I am mainly trying to figure out if all my Fst values calculated for all pairwise comparisons for a given window would have equal weight - i.e. all have had 100 SNPs contribute, or are ignored. I think that my main idea that they might have an Fst calculated with fewer than this (so pwc1 = 100 SNPs contribute, pwc2=95 SNPs contribute, pwc3=72 SNPs contribute etc) is because there is that last window with the leftover SNPs (e.g. 44 rather than 100 positions).

Hm well, the way it sounds, not all of them would have all positions in there. But with that windowing approach, that is tricky. I could implement something that outputs the number of non-zero positions in each sample as an additional output to the fst command. As of now, you can get those using the frenquency command by using --write-sample-coverage to get the coverage per sample, and then write some script to bring that information back into your windows....

If you were instead using fixed sized windows ("sliding"), that would be easier, as you could run the command for each pair of samples individually, and it would report the number of positions in each window that were used. With the "queue" windowing, that won't work, as you'd get non-synchronized windows, where each pair of samples would determine their own set of 100 SNPs, making downstream comparisons very hard...

Also, as for the BAM file thing, maybe have a look at the second half of my answer here as well.

Hope that helps, and let me know about any further questions!

Cheers Lucas

lczech commented 3 months ago

Hi @SheepwormJM,

finally, an update on this! The latest release grenedalf v0.5.0 now has a completely re-designed approach to the filtering, which also makes the queue window approach work in a more intuitive way, see here. The release also implements a lot of other improvements around the whole filtering, windowing, and window averaging under missing data conditions.

I think this finally answers your original question in a more reasonable way. Hence, closing the issue now, but feel free to re-open or start a new one if you have further questions!

Cheers and so long Lucas

SheepwormJM commented 3 months ago

Thanks so much Lucas! I'll take a look at this! 🙂


Please note that I don’t expect you to read or respond to this email outside your normal working hours, or when on holiday. I will not normally check emails outside of working hours.

I currently work Monday to Friday and will aim to respond to you as soon as I can when I am in the office.

Jennifer McIntyre

Rm 242, Henry Wellcome Building

University of Glasgow

Garscube Estate

G61 1QH

Tel: 0141 330 8216 / 07791 644 116

Github: https://github.com/SheepwormJM


From: Lucas Czech @.> Sent: 08 June 2024 14:28 To: lczech/grenedalf @.> Cc: Jennifer McIntyre @.>; Mention @.> Subject: Re: [lczech/grenedalf] Question: How does the SNP window work? (Issue #10)

Hi @SheepwormJMhttps://github.com/SheepwormJM,

finally, an update on this! The latest release grenedalf v0.5.0https://github.com/lczech/grenedalf/releases/tag/v0.5.0 now has a completely re-designed approach to the filtering, which also makes the queue window approach work in a more intuitive way, see herehttps://github.com/lczech/grenedalf/wiki/Windowing. The release also implements a lot of other improvements around the whole filtering, windowing, and window averaging under missing data conditions.

I think this finally answers your original question in a more reasonable way. Hence, closing the issue now, but feel free to re-open or start a new one if you have further questions!

Cheers and so long Lucas

— Reply to this email directly, view it on GitHubhttps://github.com/lczech/grenedalf/issues/10#issuecomment-2156037411, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANH2WOQWZGUBGCALDP4LU6LZGMBHRAVCNFSM6AAAAABI773DACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGAZTONBRGE. You are receiving this because you were mentioned.Message ID: @.***>