omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

Some question about "best-practice recommendations" in article #179

Closed Y-Isaac closed 6 months ago

Y-Isaac commented 6 months ago

First of all, I want to thank your team for the effort put into developing such a great fine-mapping framework.

While reading the article "Functionally informed fine-mapping and polygenic localization of complex trait heritability," I came across some confusion regarding the best-practice recommendations, specifically the second point. The article states: "PolyFun + SuSiE can alternatively use a nonoverlapping LD reference panel from the target population spanning ≥10% of the target sample size, with L = 10." Does this mean that I need to use genetic information from the same ethnicity as the target population but nonoverlapping as the LD reference panel? For instance, if I am conducting a meta-analysis with UK Biobank (UKB) and another cohort's data, but I only have access to the genetic data of individuals from UKB, should I use the UKB genetic data alone as the reference panel, or should I look for nonoverlapping European individual data as the reference panel?

Furthermore, if I were to conduct a cross-ethnic meta-analysis, how should I go about finding genetic information from a population consistent with the target population?

Thanks for your reply in advance!

Y-Isaac commented 6 months ago

Additionally, I would like to inquire about adopting a more universally applicable locus delineation scheme for general GWAS studies.

In my idea, delineating loci centered around the lead SNP might be more precise, which is also mentioned in the article stating, "However, we envision that the PolyFun software will primarily be used to fine-map genome-wide-significant loci, which harbor most SNPs with PIP > 0.95." Therefore, my initial thought is to divide a 3MB window centered on the lead SNP. However, given the large sample size of my GWAS and the numerous significant loci, this large window approach will result in many overlapping loci. I am somewhat puzzled as to whether I should merge these overlapping loci(and it will result one locus containing so many lead SNP, which may exceed the maximum number of causal sites set by the program), or, as mentioned in the article, allow overlaps and assign to each overlapping SNP the PIPs calculated for the closer locus.

These are just some of my preliminary thoughts, and I hope you can provide additional insights. Thank you very much for your help!

Y-Isaac commented 6 months ago

Additionally, I would like to inquire about adopting a more universally applicable locus delineation scheme for general GWAS studies.

In my idea, delineating loci centered around the lead SNP might be more precise, which is also mentioned in the article stating, "However, we envision that the PolyFun software will primarily be used to fine-map genome-wide-significant loci, which harbor most SNPs with PIP > 0.95." Therefore, my initial thought is to divide a 3MB window centered on the lead SNP. However, given the large sample size of my GWAS and the numerous significant loci, this large window approach will result in many overlapping loci. I am somewhat puzzled as to whether I should merge these overlapping loci(and it will result one locus containing so many lead SNP, which may exceed the maximum number of causal sites set by the program), or, as mentioned in the article, allow overlaps and assign to each overlapping SNP the PIPs calculated for the closer locus.

These are just some of my preliminary thoughts, and I hope you can provide additional insights. Thank you very much for your help!

Would a window size of 1MB be better? Please forgive me, I’m really unsure about this.

Also, in the discussion section of the journal, I noticed you mentioned the COJO software. I’m wondering if it’s a good choice to divide loci based on conditional regression results, and set only one causal site for each loci?

omerwe commented 6 months ago

@Y-Isaac these are great questions, but I'm not sure I have easy answers. To answer your first question: There's no easy way to do find-mapping of trans-ethnic studies, unless you have in-sample LD. It's just a question of data: You can't escape the fact that your LD reference panel doesn't exactly represent the actual population used to generate summary statistics. I don't think any method could ever solve this... Sorry I don't have an easy solution, but I don't think there is one.

For your second question, I think having multiple windows, each around a lead SNP, is the right way to go. If a SNP falls within multiple overlapping windows, I would just the PIP from the window where that SNP is closest to the center. I don't think you can afford to define a huge window size, either statistically or computationally, so you have to use some kind of approximation anyway, and we found this to be a useful approximation.

For the third question: I wouldn't use COJO together with PolyFun, it's not compatible with the assumptions of PolyFun and could lead to strange results.

Hope this helps, please let me know if not (if possible, please ask different questions in different GitHub issues to avoid mixing several threads together).

Y-Isaac commented 6 months ago

@Y-Isaac these are great questions, but I'm not sure I have easy answers. To answer your first question: There's no easy way to do find-mapping of trans-ethnic studies, unless you have in-sample LD. It's just a question of data: You can't escape the fact that your LD reference panel doesn't exactly represent the actual population used to generate summary statistics. I don't think any method could ever solve this... Sorry I don't have an easy solution, but I don't think there is one.

For your second question, I think having multiple windows, each around a lead SNP, is the right way to go. If a SNP falls within multiple overlapping windows, I would just the PIP from the window where that SNP is closest to the center. I don't think you can afford to define a huge window size, either statistically or computationally, so you have to use some kind of approximation anyway, and we found this to be a useful approximation.

For the third question: I wouldn't use COJO together with PolyFun, it's not compatible with the assumptions of PolyFun and could lead to strange results.

Hope this helps, please let me know if not (if possible, please ask different questions in different GitHub issues to avoid mixing several threads together).

Thank you so much for your assistance! I feel my understanding of PolyFun has become more accurate. I apologize for any confusion caused, and I will try to avoid mixing multiple questions in the future.

Regarding my first question, I still have some uncertainties. When I cannot obtain the genetic data of individuals involved in a GWAS, is a good alternative to use non-overlapping individuals from the same ethnicity as an LD reference panel(the number of non-overlapping samples should comprise at least 10% of the total)? This is in reference to the statement "PolyFun + SuSiE can alternatively use a nonoverlapping LD reference panel from the target population spanning ≥10% of the target sample size." I want to make sure my understanding of this statement is correct (in non-trans-ethnic studies).

Thank you again for your reply, and I wish you a pleasant day!

Y-Isaac commented 6 months ago

@Y-Isaac these are great questions, but I'm not sure I have easy answers. To answer your first question: There's no easy way to do find-mapping of trans-ethnic studies, unless you have in-sample LD. It's just a question of data: You can't escape the fact that your LD reference panel doesn't exactly represent the actual population used to generate summary statistics. I don't think any method could ever solve this... Sorry I don't have an easy solution, but I don't think there is one. For your second question, I think having multiple windows, each around a lead SNP, is the right way to go. If a SNP falls within multiple overlapping windows, I would just the PIP from the window where that SNP is closest to the center. I don't think you can afford to define a huge window size, either statistically or computationally, so you have to use some kind of approximation anyway, and we found this to be a useful approximation. For the third question: I wouldn't use COJO together with PolyFun, it's not compatible with the assumptions of PolyFun and could lead to strange results. Hope this helps, please let me know if not (if possible, please ask different questions in different GitHub issues to avoid mixing several threads together).

Thank you so much for your assistance! I feel my understanding of PolyFun has become more accurate. I apologize for any confusion caused, and I will try to avoid mixing multiple questions in the future.

Regarding my first question, I still have some uncertainties. When I cannot obtain the genetic data of individuals involved in a GWAS, is a good alternative to use non-overlapping individuals from the same ethnicity as an LD reference panel(the number of non-overlapping samples should comprise at least 10% of the total)? This is in reference to the statement "PolyFun + SuSiE can alternatively use a nonoverlapping LD reference panel from the target population spanning ≥10% of the target sample size." I want to make sure my understanding of this statement is correct (in non-trans-ethnic studies).

Thank you again for your reply, and I wish you a pleasant day!

Please forgive me for not being very clear in my explanation. My biggest concern is that, when using data from biological databases like UKB for GWAS-meta analysis, it's very challenging to obtain individual genotype data for each cohort. If Polyfun requires at least 10% non-overlapping individuals as an LD reference, for instance, when the total sample size reaches 500,000, it implies that researchers must find an additional 50,000 non-overlapping individual genotype data, which seems also hard to achieve. That's a confusion of mine.

I hope my question doesn't bother you, and thanks again!

omerwe commented 6 months ago

@Y-Isaac. For your first question: Yes, using a sufficiently large LD non-overlapping (i.e., external) reference panel from the target population it a reasonable substitute for in-sample LD, but it has to be exactly the target population. This is not easy to obtain unfortunately...

Re your second question: You're right, and accurate fine-mapping is indeed very challenging in practice. I wish I had something more constructive to say, but our experiments indicate that fine-mapping can often give wrong results with even slight inaccuracies in the LD reference panel. It is what it is... I wish that more studies would publicly release their LD reference panels, but unfortunately not many studies do this...

Y-Isaac commented 6 months ago

@Y-Isaac. For your first question: Yes, using a sufficiently large LD non-overlapping (i.e., external) reference panel from the target population it a reasonable substitute for in-sample LD, but it has to be exactly the target population. This is not easy to obtain unfortunately...

Re your second question: You're right, and accurate fine-mapping is indeed very challenging in practice. I wish I had something more constructive to say, but our experiments indicate that fine-mapping can often give wrong results with even slight inaccuracies in the LD reference panel. It is what it is... I wish that more studies would publicly release their LD reference panels, but unfortunately not many studies do this...

Yes, I agree with you. Thanks for your help!