ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
117 stars 14 forks source link

--sites_file argument ouputs only NAs in avg_pi column #55

Closed tshalev closed 2 years ago

tshalev commented 2 years ago

I decided to try out the --sites_file argument for some zero-fold and four-fold degenerate sites. I ran pixy v1.2.5b with the following arguments: pixy --stats pi --vcf allsites.vcf.gz --sites_file zero_fold sites.txt --bed_file primary_genes.bed --populations popfile.txt

My sites file looks like this: 28166317 1056 28231171 1460 28317569 5681 29126773 4129 29253970 2056 29334608 228057 29334608 247168 29334608 276102 29334608 276433 29334608 276540 29334608 278115

(first few lines), is tab delimited and formatted exactly the same as the CHROM and POS columns in the VCF. The run goes to completion, but the output looks like this:

pop chromosome window_pos_1 window_pos_2 avg_pi no_sites count_diffs count_comparisons count_missing diverse 29378234 211174 236465 NA 0 NA NA NA diverse 29378234 313635 321061 NA 0 0 0 75040 diverse 29378234 663334 704075 NA 0 0 0 63168 diverse 29378234 870939 928737 NA 0 0 0 75488 diverse 29378234 967457 968068 NA 0 NA NA NA diverse 29377609 296069 321271 NA 0 0 0 21504 diverse 29377609 1206104 1360768 NA 0 NA NA NA diverse 29377609 1725734 1864601 NA 0 NA NA NA diverse 29377609 2508452 2582283 NA 0 0 0 98784 diverse 29377609 3516420 3581643 NA 0 NA NA NA

So, some of these sites are not in a window, and we can see that in the lines where everything is NA. But the ones that show just missing should have some outcome. I know for certain that there are SNPs in those windows, and that there is at least some non-zero pi. For certain the data isn't all missing.

Does pixy have issues running --sites_file with nonstandard chromosome names?

Thanks, Tal

ksamuk commented 2 years ago

Hi Tal, hope all is well at UBC. Hard to know exactly what's going on here without looking at all the data together. Can you send me (ksamuk@gmail.com) the following?

tshalev commented 2 years ago

Hi Kieran,

Thanks. I've sent it to you now.

ksamuk commented 2 years ago

Hi Tal, thanks for sending along your data. I tracked down the issue. Basically, it was an edge case where if both a sites file and a bed file and purely numeric chromosome IDs were used, a window calculation would fail (silently, fun stuff).

This is now fixed in pixy 1.2.6.beta1, which should build on conda-forge overnight. Have a look tomorrow to see if it's up here: https://anaconda.org/conda-forge/pixy/files (should see 1.2.6.beta1), update pixy, and try re-running your analysis.

tshalev commented 2 years ago

Thanks Kieran! I will check it out tomorrow, and closer the issue if all goes well.

-Tal

On Feb. 9, 2022 22:16, Kieran Samuk @.***> wrote: [CAUTION: Non-UBC Email]

Hi Tal, thanks for sending along your data. I tracked down the issue. Basically, it was an edge case where if both a sites file and a bed file and purely numeric chromosome IDs were used, a window calculation would fail (silently, fun stuff).

This is now fixed in pixy 1.2.6.beta1, which should build on conda-forge overnight. Have a look tomorrow to see if it's up here: https://anaconda.org/conda-forge/pixy/files (should see 1.2.6.beta1), update pixy, and try re-running your analysis.

— Reply to this email directly, view it on GitHubhttps://github.com/ksamuk/pixy/issues/55#issuecomment-1034533029, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACDUZ7WHDGEJM3XBYHUCZ5DU2NJBJANCNFSM5NN6M6LQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

tshalev commented 2 years ago

So, it does run now and I get results. Out of curiosity, how does this approach differ from using a VCF file filtered to only include these sites (and invariant sites)? Because, using this approach the estimates of diversity are several orders of magnitude larger for each site. For example:

Using the --sites option: diverse 29377047 1544252 1625615 0.50208199871877 1 12540 24976 120288

Using a filtered VCF file: diverse 29377047 1544252 1625615 0.0009535954159284349 537 12540 13150231 1178

It seems that the count comparisons is much larger in the latter. Am I utilising the sites option correctly? It just seems strange that pi for my entire data set is around 0.003 and pi at 4-fold/0-fold sites is around 0.3.

ksamuk commented 2 years ago

Hi Tal, The way things are set up currently is that the sites file specifies the sites – variant and invariant – where pi will be calculated. So if you only include 4-fold sites (many of which will be SNPs and not invariant sites), pi will naturally be much higher.

I think its generally conventional to include the invariant sites when calculating 4-fold pi, but I've seen it done with just the sites themselves as well. Just depends on what you want to do.

tshalev commented 2 years ago

Thanks Kieran; I had thought that it was already including the invariant sites. The results make more sense now.

From: Kieran Samuk @.> Sent: February 11, 2022 3:13 PM To: ksamuk/pixy @.> Cc: Shalev, Tal @.>; Author @.> Subject: Re: [ksamuk/pixy] --sites_file argument ouputs only NAs in avg_pi column (Issue #55)

[CAUTION: Non-UBC Email]

Hi Tal, The way things are set up current is that the sites file specifies the sites – variant and invariant – where pi will be calculated. So if you only include 4-fold sites (many of which will be SNPs and not invariant sites), pi will naturally be much higher.

I think its generally conventional to include the invariant sites when calculating 4-fold pi, but I've seen it done with just the sites themselves as well. Just depends on what you want to do.

— Reply to this email directly, view it on GitHubhttps://github.com/ksamuk/pixy/issues/55#issuecomment-1036739042, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACDUZ7SKUXU7Z4XL227BBP3U2WJXJANCNFSM5NN6M6LQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.**@.>>

tshalev commented 2 years ago

Sorry to reopen this issue; but regarding your previous comment, considering that some invariant sites are naturally either 4-fold or 0-fold degenerate sites, would it make sense to only use, for example, 4-fold invariant and invariant sites together, or should all invariant sites irrespective of their potential codon impact be used? I initially thought the latter made more sense, but I have now seen some studies that appear to take the former approach.

ksamuk commented 2 years ago

Hi Tal, this seems like a decision you'll need to make about your analysis based on your goals/your reading of the literature -- pixy can accommodate either one!

tshalev commented 2 years ago

No problem - thanks again.