Possible incorrect labeling of SNPs

Hello, and thanks for making the PWCoCo resource available!

I've noticed something that I think may be a bug related to swapped SNP labels. Here is a toy example demonstrating the issue. It only runs PWCoCo on 3 SNPs, which we would obviously never do, but it recreates the problem I've observed in real runs of hundreds of variants. Coordinates are hg19:

Exposure:

1:27220521:T:G G T 0.1 -0.030001 0.003 0.0000000000000000000000001
1:27221963:T:C C T 0.1 -0.030002 0.003 0.0000000000000000000000001
1:27304568:A:G G A 0.7 0.015 0.002 0.00000000001

Outcome:

1:27220521:T:G G T 0.15 -0.021 0.01 0.05
1:27221963:T:C C T 0.15 -0.020 0.01 0.05
1:27304568:A:G G A 0.75 0.01 0.01 0.2

An excerpt from the log output is as follows:

[exposure] Total amount of SNPs matched from phenotype file with reference SNPs are: 3.
[exposure] Performing stepwise model selection on 3 SNPs; p cutoff = 5e-08, collinearity = 0.9 assuming complete LE between SNPs more than 10.0 Mb away).
[exposure] Selected SNP 1:27221963:T:C with chisq 100.01 and pval 1.51e-23.
[exposure] Selected entry SNP 1:27221963:T:C with cpval 9.22e-17.
[exposure] Selected entry SNP 1:27220521:T:G with cpval 2.00e+00.
[exposure] 1:27220521:T:G does not meet threshold

You can see that 1:27221963:T:C is selected as the first signal, which makes sense. But then it is also listed as the next 'Selected entry SNP' which doesn't make sense.

The .coloc and .cojo output files (below) indicate that the 2 conditionally independent SNPs selected by COJO are 1:27221963:T:C and 1:27304568:A:G. This makes sense, given that these 2 SNPs are not in LD (r2=0.007), and both have significant unconditioned p-values (the 3rd inputted SNP, 1:27220521:T:G, has near perfect LD with 1:27221963:T:C, so it makes sense it wouldn't be one of the independent signals).

.coloc file

Dataset1    Dataset2    SNP1    SNP2    nsnps   H0  H1  H2  H3  H4  log_abf_all
exposure    outcome unconditioned   unconditioned   3   0.99937 0.000299811 0.000299811 5.99622e-08 2.99811e-05 0.000629862
exposure    outcome 1:27221963:T:C  unconditioned   2   9.73607e-20 0.909008    1.94721e-23 9.09008e-05 0.0909008   43.7759
exposure    outcome 1:27304568:A:G  unconditioned   1   6.16114e-10 0.909091    6.16114e-14 0   0.0909091   21.2076

out.exposure.1:27221963:T:C.cojo

Chr SNP bp  refA    freq    b   se  p   n   freq_geno   bC  bC_se   pC  1:27221963:T:C  1:27304568:A:G
1   1:27220521:T:G  27220521    G   0.1 -0.030001   0.003   1.51885e-23 617185  0.1504  -0.0318381  0.00300024  2.62499e-26 0.997871    0.00691254
1   1:27221963:T:C  27221963    C   0.1 -0.030002   0.003   1.51375e-23 617185  0.150307    -0.0318353  0.00300024  2.65143e-26 1   0.00688391

Based on this toy example, as well as occasional full scale analyses, I'm wondering if some sort of SNP label swapping is occurring (probably in the log file). In a larger-scale run (>100 variants), I have seen an example where the 2nd 'Selected entry SNP' (SNP2) is reported in the log file as significant after conditioning on the first selected SNP (SNP1), but then in the corresponding .cojo file that has conditioned only on the first selected SNP1, the pC for SNP2 is >0.05 (i.e., not significant), and a different SNP altogether (SNP3) has the smallest pC which is actually equal to the cpval shown in the log file for SNP2! Hence my thinking that perhaps SNP labels have been swapped somehow.

That is my main concern. Another thing that is pretty minor (I think) is that in the toy example above, the .cojo file results show that the pC value for 1:27220521:T:G is actually slightly smaller than the pC for 1:27221963:T:C, and yet the latter is selected as the lead SNP for that independent signal. I'm assuming this is just a rounding matter? These pC values are clearly not meaningfully different, I just want to be sure nothing unexpected is going on here.

Thank you!

jwr-git / pwcoco

Possible incorrect labeling of SNPs #12