hardingnj / xpclr

Code to compute the XP-CLR statistic to infer natural selection
MIT License
85 stars 26 forks source link

polyploid vcf #72

Open janet-higgins13 opened 2 years ago

janet-higgins13 commented 2 years ago

I am running XP-CLR using a vcf output for ploidy3 so the genotype is shown as follows. 0/0/1:16,4:20:36:110,0,36,673

How does XP-CLR interpret triploid calls? I am just checking whether it takes account of the third call or just ignores it and assumes the vcf is diploid?

Obviously, I would like to run it using the triploid vcf if this is valid

I have transformed my vcf to diploid by changing 0/1/1 and 0/0/1 to 0/1

Comparing the output from both runs, I am loosing a lot of SNPs from the triploid run

diploid 2021-09-14 09:33:09 : INFO : running xpclr v1.0.0 2021-09-14 09:33:09 : INFO : Loading VCF 2021-09-14 09:33:23 : INFO : VCF loading complete 2021-09-14 09:33:23 : INFO : 15,265 SNPs in total are in the provided input files 2021-09-14 09:33:23 : INFO : 0 SNPs excluded as multiallelic 2021-09-14 09:33:23 : INFO : 10 SNPs excluded as missing in all samples in a population 2021-09-14 09:33:23 : INFO : 8,243 SNPs excluded as invariant or singleton in population 2 2021-09-14 09:33:23 : INFO : 7,012/15,265 SNPs included in the analysis (45.94%) 2021-09-14 09:33:23 : INFO : Done dropping above SNPs from analysis. XP-CLR algorithm starting. 2021-09-14 09:33:23 : INFO : No genetic distance provided; using rrate of 1e-08/bp 2021-09-14 09:33:23 : INFO : Omega estimated as : 0.362105

triploid 2021-09-14 09:33:08 : INFO : running xpclr v1.0.0 2021-09-14 09:33:08 : INFO : Loading VCF 2021-09-14 09:33:26 : INFO : VCF loading complete 2021-09-14 09:33:26 : INFO : 24,606 SNPs in total are in the provided input files 2021-09-14 09:33:26 : INFO : 0 SNPs excluded as multiallelic 2021-09-14 09:33:26 : INFO : 0 SNPs excluded as missing in all samples in a population 2021-09-14 09:33:26 : INFO : 22,670 SNPs excluded as invariant or singleton in population 2 2021-09-14 09:33:26 : INFO : 1,936/24,606 SNPs included in the analysis (7.87%) 2021-09-14 09:33:26 : INFO : Done dropping above SNPs from analysis. XP-CLR algorithm starting. 2021-09-14 09:33:26 : INFO : No genetic distance provided; using rrate of 1e-08/bp 2021-09-14 09:33:26 : INFO : Omega estimated as : 0.609867

Would you recommend using the diploid or triploid vcf?

hardingnj commented 2 years ago

Hi Janet.

Interesting problem. As XPCLR only requires population allele counts, the triploid should work fine. VCFs are read in using scikit- allel which supports multiple ploidy.

I'm also not sure what populations exist in a triploid state, but I guess that shouldn't matter too much.

However, it seems that this isn't working properly, as so many of the sites in population 2 are fixed. Would you mind sharing an excerpt of the VCFs?

If not I would recommend using the scikit alele read_vcf function and seeing if it is able to interpret the ploidy in your VCF.

janet-higgins13 commented 2 years ago

Hi Nick

I have had a look at my vcf using scikit-allel 1.2.1 and it only shows diploid calls

callset = allel.read_vcf('example.vcf') gt = allel.GenotypeArray(callset['calldata/GT'])

  0 1 2 3 4 ... 185 186 187 188 189
0/0 0/0 0/0 0/0 0/0 ... 0/0 0/0 0/0 ./. ./.
0/0 0/0 0/0 0/0 0/0 ... 0/0 0/0 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0 ... 0/0 0/0 0/0 ./.  

example.vcf.txt

I have attached a sample of my vcf

Thanks Janet

hardingnj commented 2 years ago

Thanks Janet- that's interesting I would have assumed it would be read at ploidy=3. As a hack, as it only counts the alternative aleles, as long as the ratios are correct in both populations it should work fine as diploid. So you could create "fake" individuals

ie 0/0/1 and 0/1/1 would be the same as 3 diploid individuals of 0/0, 0/1 and 1/1.