jwr-git / pwcoco

Pair-wise conditional analysis and colocalisation
GNU General Public License v3.0
36 stars 4 forks source link

pwcoco log file explained #7

Closed nickhir closed 6 months ago

nickhir commented 9 months ago

Hello,

I am running PWCoCo using the following command:

pwcoco --bfile 1000g_autosomes_GRCh37 --sum_stats1 exposure.tsv --sum_stats2 outcome.tsv --maf 0.01 --out_cond --threads 8 --verbose

While going through the log file, I got confused by the "stepwise model selection" output of PWCoCo: The logfile informs me that certain SNPs are selected as "conditionally independent signals within the region" by cojo with lines such as Selected entry SNP 1_247557936_T_C with cpval 2.75e-134.. However, I am unable to find an pwcoco_out.exposure.1_247557936_T_C.cojo file corresponding to this SNP (I have checked and it was not erased later). Furthermore, later in the coloc part of the pipeline, this SNP is never conditioned on.

On the otherhand, I find cojo output files with names like this: pwcoco_out.exposure.tsv.1_247602308_T_G.cojo, but this specific SNP was never "selected" by cojo (i.e. in the log file it doesnt say "Selected entry SNP 1_247602308_T_G". Nonetheless, the SNP 1_247602308_T_G is used later in the coloc analysis ("Conditioned results for SNP1: SNP 1_247602308_T_G, SNP2: 1_247546345_G_A").

I cant really make sense of what is happening there, so I very much appreciate any insights into that matter!

On a somewhat unrelated note, I was wondering what the nsnps column means in the output format. I did not find an explanation for this in the wiki. I assume it refers to the "Colocalisation analysis initialised with 848 SNPs" part of the log file, but I was wondering how the number of SNPs that are used for the coloc analysis is determined.

Thank you very much in advance!

YStrauchP4 commented 6 months ago

I agree the cojo output seems to be buggy.

Here a minimal example:

assoc1:

1:27052080:G:A A G 0.1 0.001 0.001 0.01
1:27220521:T:G G T 0.01 0.01 0.001 0.001

assoc2:

1:27052080:G:A A G 0.1 0.001 0.01 0.1
1:27220521:T:G G T 0.1 0.01 0.01 0.01

command:

pwcoco \
  --bfile ./plinkout \
  --sum_stats1 ./assoc1 \
  --sum_stats2 ./assoc2 \
  --pve1 1 \
  --pve2 1 \
  --out ./output/out \
  --log ./output/log \
  --out_cond

log (excerpt):

[assoc1] Selected SNP 1:27220521:T:G with chisq 100.00 and pval 1.52e-23.
[assoc1] Selected entry SNP 1:27052080:G:A with cpval 9.30e-02.
[assoc1] 1:27052080:G:A does not meet threshold
[assoc1] Finally, 1 associated SNPs have been selected.
[assoc2] Total amount of SNPs matched from phenotype file with reference SNPs are: 2.
[assoc2] Performing stepwise model selection on 2 SNPs; p cutoff = 5e-08, collinearity = 0.9 assuming complete LE between SNPs more than 10.0 Mb away).
[assoc2] Selected SNP 1:27220521:T:G with chisq 1.00 and pval 3.17e-01.
[assoc2] SNP did not meet threshold.
[assoc2] No SNPs have been selected by the step-wise selection algorithm. Using the unconditioned dataset.
There are 1 selected SNPs in the exposure dataset and 0 in the outcome dataset.
Performing 1 conditional and colocalisation analyses.
Colocalisation analysis initialised with 1 SNPs.
Conditioned results for SNP1: 1:27220521:T:G*, SNP2: unconditioned

coloc result:

Dataset1    Dataset2    SNP1    SNP2    nsnps   H0  H1  H2  H3  H4  log_abf_all
assoc1  assoc2  unconditioned   unconditioned   2   0.99958 0.000199916 0.000199916 1.99916e-08 1.99916e-05 0.000419932
assoc1  assoc2  1:27220521:T:G* unconditioned   1   0.999898    1.98952e-06 9.99898e-05 0   1.98952e-07 0.000102183

cojo files:

out.assoc1.1:27052080:G:A.cojo

Obviously, I expected the other variant as cojo output! Now the most important question: Is only the incorrect cojo file written to disk, or did PWCoCo condition on the wrong SNP, invalidating the results?

jwr-git commented 6 months ago

Thanks both, this has been fixed. Note the issue was with only the name of the file and not with the underlying statistics/results :) (You may verify this by looking in your files and ctrl+F'ing the independent SNPs given in the file header. The correct name of the file should be the SNP which has an LD value of 1 for the SNP you're trying to find)

nickhir commented 6 months ago

Im afraid I still do not understand the log file. Why are there some SNPs that get selected (indicated by the line Selected entry SNP XYZ with cpval X), but later when performing the colocalisation analyses, they never show up. I.e. if I search the file for SNP XYZ it only occures once. This is not the case for all the SNPs that get selected, the majority of them actually get analyzed later in the colocalization analysis. I am using the latest pwcoco version and the SNPs were not eliminated via backward selection. Happy to share the log file if that helps.

jwr-git commented 6 months ago

Hi,

Please read the above messages: the issue was the cojo file names were incorrect, not the log file. Update PWCoCo, re-run your analysis and the SNPs which were selected and the .cojo file names will now match.

As for your other question that I missed, the nsnps column is the number of SNPs included in the colocalisation analysis. The number of SNPs is determined based on the match between your summary statistic files and the reference data you provide, and may differ due to the the joint analysis because the SNPs included in the final colocalisation analyses are those SNPs which have been conditioned (i.e. the statistics in the cojo files).

nickhir commented 6 months ago

Hi, thank you very much for your quick answer! I do understand that the fix was unrelated to the log file, however I still do not fully understand the meaning of the different lines in the log file. Specifically, why sometimes SNP XYZ (selected entry SNP XYZ with cpval X) is later not included in the coloc analysis (i.e. no line in the log file that says Conditioned results for SNP1: XYZ, SNP2: unconditioned)

YStrauchP4 commented 6 months ago

I don't think your commit fixed the issue. This seems to name the file to a completely different SNP which was never even mentioned in the log. Also for the minimal example posted above, I now get a segfault. Which is way worse than just the wrong file name.

YStrauchP4 commented 6 months ago

I rolled back to before the change to prevent the segfault. There's more problems with the COJO files. Happy to create a new ticket for this.

When more than one SNP were conditioned, let's say N>=2, I get N COJO files, each with N SNP columns. The column contents are different from file to file. Paired with the arbitrary file names, this makes it impossible to extract information from the COJO output.