omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
94 stars 22 forks source link

Guidance on generating per-SNP heritabilities with custom annotations #198

Closed Al-Murphy closed 4 months ago

Al-Murphy commented 5 months ago

I'm hoping to generate new per-SNP heritabilities (prior distribution of the SNP effect sizes) based on some custom annotations to then be used to fine-map SNPs for a complex trait relating to a specific cell type.

I know to use the polyfun.py to create these along with the annotations.

My question is whether I should use all of the 187 annotations for functional enrichments for a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LF 2.2.UKB model as used in the paper along with my custom annotations relating to the cell type of interest (13 in total)? Or is it more advisable to use just the custom annotations to generate it?

Secondly, two follow-up questions on this -

1) where can I get the 187 annotations used for the publication? I see example_data/annotations.CHR.annot.parquet has a subset of them:

                    SNP  CHR         BP  A1 A2  Coding_UCSC_common  \
0           rs201321709    1     751580   C  T                   0   
1           rs555115897    1     769374   G  A                   0   
2           rs138499329    1     772437   C  T                   0   
3           rs183307028    1     777456   C  T                   0   
4           rs149978434    1     779286   C  A                   0    

       Coding_UCSC_lowfreq  Conserved_LindbladToh_common  \
0                        0                             0   
1                        0                             0   
2                        0                             0   
3                        0                             0   
4                        0                             0   

       Conserved_LindbladToh_lowfreq  Repressed_Hoffman_common  \
0                                  0                         0   
1                                  0                         0   
2                                  0                         0   
3                                  0                         0   
4                                  0                         0   

       Repressed_Hoffman_lowfreq  base  
0                              0     1  
1                              0     1  
2                              0     1  
3                              0     1  
4                              0     1  

2) And very much related but where can I download the LD-score weights for the UK Biobank cohort analysed in the paper? Again a subset seems to be here: ./example_data/weights.CHR.l2.ldscore.parquet. Apologies if these are very trivial questions.

Thanks!

Al-Murphy commented 5 months ago

Apologies I believe I found answers to my two follow-up questions - functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations and UK Biobank LD matrices. But would still appreciate advice on the first question. Thanks!

omerwe commented 5 months ago

@Al-Murphy for the purpose of improving fine-mapping accuracy, it's advisable to also use all 187 functional annotations on top of your 13 additional annotations (as this will lead to more informative prior causal effects that take more sources of information into consideration)

Al-Murphy commented 5 months ago

That makes a lot of sense, thank you!

Al-Murphy commented 5 months ago

@omerwe apologies but one further question - some of my custom annotations are Z scores and so have positive and negative values. I see that in the 187 annotations, the continuous values are 0-1. Will min max scaling be fine to apply to my annotations before adding them? Or would you separate these into two annotations, 1 for positive and 1 for negative since 0, the lack of an annotation would be around .5 with min max scaling?

omerwe commented 5 months ago

@Al-Murphy Z-scores are perfectly fine. You might want to normalize them to have variance 1.0, which could improve numerical stability and the behavior of the Ridge regression model.

Al-Murphy commented 4 months ago

Thanks @omerwe!