hardingnj / xpclr

Code to compute the XP-CLR statistic to infer natural selection
MIT License
85 stars 26 forks source link

No sel_coef #59

Closed PhDMattyB closed 3 years ago

PhDMattyB commented 3 years ago

Hi @hardingnj, I just have a quick question about the output of xpclr. Whenever I run xpclr I'm not getting a lot of the columns in the output file, mainly the selection coefficient. Xpclr seems to be outputting everything up until the nullL column and then dropping everything else. I just wanted to check with you to see if this is normal or not.

input code: xpclr --input popn_chr1_phased.vcf --samplesA popn_benthic.txt --samplesB popn_pelagic.txt --chr 1 --minsnps 2 --maxsnps 100 --size 200000 --step 1000 --phased --ld 0.95 --rrate 0.00000008 --out ./popn_sweep_chr1

Any advice would be greatly appreciated!

Matt

hardingnj commented 3 years ago

Hmm- seems like SNPs are being skipped- but your min SNPs seems very low. Can you share the std out, and maybe a couple of lines from your output file?

PhDMattyB commented 3 years ago

Here is the first couple of the lines from the output file:

id chrom start stop pos_start pos_stop modelL nullL sel_coef nSNPs nSNPs_avail xpclr xpclr_norm 1_00000001_00200000 1 1 200000 0 0 0.0 0.0 1_00000101_00200100 1 101 200100 0 0 0.0 0.0 1_00000201_00200200 1 201 200200 0 0 0.0 0.0 1_00000301_00200300 1 301 200300 0 0 0.0 0.0

Here is what xpclr outputs when its running: 2020-07-27 09:35:04: INFO: sunning xpclr v 1.1.2 2020-07-27 09:35:04: INFO: loading VCF 2020-07-27 09:35:04: INFO: VCF loading complete 2020-07-27 09:35:04: INFO: 393 SNPs in total are in the provided input files 2020-07-27 09:35:04: INFO: 0 SNPs excluded as multiallelic 2020-07-27 09:35:04: INFO: 0 SNPs excluded as missing in all samples in a population 2020-07-27 09:35:04: INFO: 378 SNPs excluded as invariant or singleton in population 2 2020-07-27 09:35:04: INFO: 15/393 SNPs included in the analysis (3.82%) 2020-07-27 09:35:04: INFO: Done dropping SNPs from the analysis. XP-CLR algorithm starting 2020-07-27 09:35:04: INFO: No genetic distance provided; using rrate of 1e-08/bp 2020-07-27 09:35:04: INFO: Omega estimated as: 0.834940 2020-07-27 09:36:18: INFO: Analysis complete. Output file ./GSBPI_sweep_chr1_test

This is from the analysis for the first chromosome. The amount of SNPs being excluded is a bit worrying. The vcf files for the phased data is straight from SHAPEIT and I'm wondering if something funky happened when it tried to impute the phased SNPs.

hardingnj commented 3 years ago

Yes- that's the issue I think, lack of SNPs.

What you need from population 2 is a good estimate of true ancestral allele frequencies. This allows the model to identify large deviations in pop 1 from the "original" frequency. ie fast drift. When this happens several times in the same window- that's good evidence for selection.

If you have few samples in pop 2, you don't get a good estimate of ancestral AFs, and your deviations are very stochastic.

Here it seems like the majority of your SNPs are fixed in population 2, which gives us no (useful) information about the true ancestral allele frequencies. The only way you can get from 0.0 or 1.0 to an AF between is via a mutation, not via drift. Therefore they are of no value to XPCLR.

You probably need larger samples, or alternatively to infer selection perhaps look at something like PBS, which requires an outgroup.

PhDMattyB commented 3 years ago

That makes total sense! Thanks a ton for all of your help.