omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

how to use sumstats from other methods(except bolt-lmm refered in WIKI) in PolyPred? #192

Closed Y-Isaac closed 4 months ago

Y-Isaac commented 4 months ago

HI,

I noticed that in the WIKI about PolyPred, the result generated by the parameter --predBetasFile in bolt-lmm was used as an estimate of the effect size of the tag SNP. But my association analysis was done using GCTA, and so far I haven't found any similar parameter in GCTA. I have reviewed the example files provided in the file _"polypredexample" and noticed that the bolt.beta contains much fewer SNPs than the bolt.sumstats.txt. I would like to ask what exactly happened during this process (I couldn't find the answer in the official manual of bolt-lmm)? Did it performs a process similar to "clustering and thresholding" to only include those Lead SNPs? And can I manually process the summary data generated by GCTA to achieve the same effect as bolt-lmm in WIKI?

Thanks in advanced for your help!

best regards, Issac

omerwe commented 4 months ago

@Y-Isaac I think you're asking a few different questions:

  1. Does GCTA report per-SNP marginal effect sizes? I'm sure that it does, but I'm not that familiar with it and haven't touched it in a few years... You'll have to ask elsewhere I'm afraid... Once you have the per-SNP marginal effect sizes, you can use them just like the ones reported by BOLT (or any other GWAS tool)

  2. polypred_example is a stripped-down example that is only meant to demonstrate the principles of how to run PolyPred. I think I just kept a random subset of the SNPs to allow for a small file size :)

In general, my recommendation is to use all SNPs for prediction without any filtering step.

Hope this answers your questions, please let me know if not!

Y-Isaac commented 4 months ago

@Y-Isaac I think you're asking a few different questions:

  1. Does GCTA report per-SNP marginal effect sizes? I'm sure that it does, but I'm not that familiar with it and haven't touched it in a few years... You'll have to ask elsewhere I'm afraid... Once you have the per-SNP marginal effect sizes, you can use them just like the ones reported by BOLT (or any other GWAS tool)
  2. polypred_example is a stripped-down example that is only meant to demonstrate the principles of how to run PolyPred. I think I just kept a random subset of the SNPs to allow for a small file size :)

In general, my recommendation is to use all SNPs for prediction without any filtering step.

Hope this answers your questions, please let me know if not!

HI,

Thanks for your answer! Sorry, my previous question might have been a bit ambiguous, so let me clarify:

  1. Yes, GCTA does report per-SNP marginal effect sizes.

  2. My main concern lies in the section "Estimating tagging SNP effect sizes using another method" on the WIKI, where you used both the--statsFile and --predBetasFile parameters in bolt-lmm. These parameters output two files, and I noticed that the number of variants in the --predBetasFile is much fewer than in the --statsFile (in my case, the former has just over 370,000 while the latter has over 13 million). Therefore, I suspect that the loci output by --predBetasFile may have undergone some stricter filter. I am sorry that I couldn’t find more detailed explanations in the bolt-lmm official documentation.

  3. I would like to confirm which file from bolt-lmm, "bolt.betas.gz", is used in the WIKI section "Linearly combining the effect sizes of PolyFun and the other method"—is it from --predBetasFile or --statsFile? Because you mentioned in the WIKI that what is needed is the "Effect for tag SNPs", but in your reply, you said "use all SNPs for prediction without any filtering step", which has left me somewhat confused.

I hope my questions don’t trouble you! I am really looking forward to your guidance once more, and hope you have a wonderful day!

Best regards, Issac

omerwe commented 4 months ago

@Y-Isaac sorry I think I misunderstood your previous question:

  1. You need to use the output file specified in the BOLT-LMM parameter --predBetasFile (for linearly combining effect size estimates of different methods)

  2. Again, I believe my example of BOLT-LMM output files took a random subset of the SNPs for demonstration purposes only. I just selected a subset of SNPs to make the examples run faster and take less storage space. In practice, the BOLT-LMM output file should include millions of SNPs.

  3. My previous response had a mistake. I wrote that GCTA needs to give you per-SNP marginal effect sizes. However, what you need are per-SNP joint effect sizes, not per-SNP marginal effect sizes (this similar to the difference between the BOLT-LMM output files). I assume GCTA reports this quantity, but I'm not familiar with it.

Y-Isaac commented 4 months ago

Thanks for your explanation!