Closed nickhir closed 6 months ago
I agree the cojo output seems to be buggy.
Here a minimal example:
assoc1:
1:27052080:G:A A G 0.1 0.001 0.001 0.01
1:27220521:T:G G T 0.01 0.01 0.001 0.001
assoc2:
1:27052080:G:A A G 0.1 0.001 0.01 0.1
1:27220521:T:G G T 0.1 0.01 0.01 0.01
command:
pwcoco \
--bfile ./plinkout \
--sum_stats1 ./assoc1 \
--sum_stats2 ./assoc2 \
--pve1 1 \
--pve2 1 \
--out ./output/out \
--log ./output/log \
--out_cond
log (excerpt):
[assoc1] Selected SNP 1:27220521:T:G with chisq 100.00 and pval 1.52e-23.
[assoc1] Selected entry SNP 1:27052080:G:A with cpval 9.30e-02.
[assoc1] 1:27052080:G:A does not meet threshold
[assoc1] Finally, 1 associated SNPs have been selected.
[assoc2] Total amount of SNPs matched from phenotype file with reference SNPs are: 2.
[assoc2] Performing stepwise model selection on 2 SNPs; p cutoff = 5e-08, collinearity = 0.9 assuming complete LE between SNPs more than 10.0 Mb away).
[assoc2] Selected SNP 1:27220521:T:G with chisq 1.00 and pval 3.17e-01.
[assoc2] SNP did not meet threshold.
[assoc2] No SNPs have been selected by the step-wise selection algorithm. Using the unconditioned dataset.
There are 1 selected SNPs in the exposure dataset and 0 in the outcome dataset.
Performing 1 conditional and colocalisation analyses.
Colocalisation analysis initialised with 1 SNPs.
Conditioned results for SNP1: 1:27220521:T:G*, SNP2: unconditioned
coloc result:
Dataset1 Dataset2 SNP1 SNP2 nsnps H0 H1 H2 H3 H4 log_abf_all
assoc1 assoc2 unconditioned unconditioned 2 0.99958 0.000199916 0.000199916 1.99916e-08 1.99916e-05 0.000419932
assoc1 assoc2 1:27220521:T:G* unconditioned 1 0.999898 1.98952e-06 9.99898e-05 0 1.98952e-07 0.000102183
cojo files:
out.assoc1.1:27052080:G:A.cojo
Obviously, I expected the other variant as cojo output! Now the most important question: Is only the incorrect cojo file written to disk, or did PWCoCo condition on the wrong SNP, invalidating the results?
Thanks both, this has been fixed. Note the issue was with only the name of the file and not with the underlying statistics/results :) (You may verify this by looking in your files and ctrl+F'ing the independent SNPs given in the file header. The correct name of the file should be the SNP which has an LD value of 1 for the SNP you're trying to find)
Im afraid I still do not understand the log file. Why are there some SNPs that get selected (indicated by the line Selected entry SNP XYZ with cpval X
), but later when performing the colocalisation analyses, they never show up. I.e. if I search the file for SNP XYZ it only occures once. This is not the case for all the SNPs that get selected, the majority of them actually get analyzed later in the colocalization analysis. I am using the latest pwcoco version and the SNPs were not eliminated via backward selection. Happy to share the log file if that helps.
Hi,
Please read the above messages: the issue was the cojo file names were incorrect, not the log file. Update PWCoCo, re-run your analysis and the SNPs which were selected and the .cojo file names will now match.
As for your other question that I missed, the nsnps
column is the number of SNPs included in the colocalisation analysis. The number of SNPs is determined based on the match between your summary statistic files and the reference data you provide, and may differ due to the the joint analysis because the SNPs included in the final colocalisation analyses are those SNPs which have been conditioned (i.e. the statistics in the cojo files).
Hi, thank you very much for your quick answer!
I do understand that the fix was unrelated to the log file, however I still do not fully understand the meaning of the different lines in the log file. Specifically, why sometimes SNP XYZ (selected entry SNP XYZ with cpval X
) is later not included in the coloc analysis (i.e. no line in the log file that says Conditioned results for SNP1: XYZ, SNP2: unconditioned
)
I don't think your commit fixed the issue. This seems to name the file to a completely different SNP which was never even mentioned in the log. Also for the minimal example posted above, I now get a segfault. Which is way worse than just the wrong file name.
I rolled back to before the change to prevent the segfault. There's more problems with the COJO files. Happy to create a new ticket for this.
When more than one SNP were conditioned, let's say N>=2
, I get N
COJO files, each with N
SNP columns. The column contents are different from file to file. Paired with the arbitrary file names, this makes it impossible to extract information from the COJO output.
Hello,
I am running PWCoCo using the following command:
While going through the log file, I got confused by the "stepwise model selection" output of PWCoCo: The logfile informs me that certain SNPs are selected as "conditionally independent signals within the region" by cojo with lines such as
Selected entry SNP 1_247557936_T_C with cpval 2.75e-134.
. However, I am unable to find anpwcoco_out.exposure.1_247557936_T_C.cojo
file corresponding to this SNP (I have checked and it was not erased later). Furthermore, later in the coloc part of the pipeline, this SNP is never conditioned on.On the otherhand, I find cojo output files with names like this:
pwcoco_out.exposure.tsv.1_247602308_T_G.cojo
, but this specific SNP was never "selected" by cojo (i.e. in the log file it doesnt say "Selected entry SNP 1_247602308_T_G". Nonetheless, theSNP 1_247602308_T_G
is used later in the coloc analysis ("Conditioned results for SNP1: SNP 1_247602308_T_G, SNP2: 1_247546345_G_A").I cant really make sense of what is happening there, so I very much appreciate any insights into that matter!
On a somewhat unrelated note, I was wondering what the
nsnps
column means in the output format. I did not find an explanation for this in the wiki. I assume it refers to the "Colocalisation analysis initialised with 848 SNPs" part of the log file, but I was wondering how the number of SNPs that are used for the coloc analysis is determined.Thank you very much in advance!