omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
89 stars 22 forks source link

Enrichment .results file #16

Closed bschilder closed 4 years ago

bschilder commented 4 years ago

I noticed in the original S-LDSC that a .results file with the enrichment scores for each annotation is also produced. Is PolyFun also capable of producing this file (or something comparable)?

Thanks, Brian

From the LDSC wiki:

.results file

This file has the results of the analysis in tab-delimited form. If any category contains all SNPs, then that category will not appear in this file. There is one row for each category and columns summarizing the results: Proportion of SNPs, Proportion of heritability, Enrichment, and standard errors. Enrichment is (Prop. heritability) / (Prop. SNPs). If you use the --print-coefficients flag, then there will also be columns for the regression coefficients. (See Finucane, Bulik-Sullivan et al., bioRxiv for a discussion of the relationship between the coefficients and proportions of heritability.)

omerwe commented 4 years ago

Hi,

I guess it depends on what exactly you're interested in. If you'd like enrichment estimates for the various annotations, then PolyFun isn't the best tool for this because it uses a regularized version of S-LDSC, leading to biased estimates and uncalibrated inference. In this case you're better off running regular S-LDSC.

You can run regular S-LDSC from within the polyfun code. This is convenient because vanilla S-LDSC doesn't support python 3 and/or parquet files. To do this, simply invoke ldsc.py from the main polyfun directory (after git pull), just like you would invoke regular S-LDSC. Please note that this is a heavily modified version of ldsc but the output should be exactly the same.

BTW I didn't provide M_5_50 files for the baselineLF annotations, so you will have to run ldsc.py with the flag --not-M-5-50. This means that the enrichment estimates will be provided with respect to MAF>0.1% SNPs instead of MAF>5% SNPs.

Hope it's clear, please let me know if not!

Omer

On Sat, Dec 7, 2019 at 4:09 PM Brian M. Schilder notifications@github.com wrote:

I noticed in the original S-LDSC that a .results file with the enrichment scores for each annotation is also produced. Is PolyFun also capable of producing this file (or something comparable)?

Thanks, Brian

From the LDSC wiki https://github.com/bulik/ldsc/wiki/Partitioned-Heritability#results-file : .results file

This file has the results of the analysis in tab-delimited form. If any category contains all SNPs, then that category will not appear in this file. There is one row for each category and columns summarizing the results: Proportion of SNPs, Proportion of heritability, Enrichment, and standard errors. Enrichment is (Prop. heritability) / (Prop. SNPs). If you use the --print-coefficients flag, then there will also be columns for the regression coefficients. (See Finucane, Bulik-Sullivan et al., bioRxiv for a discussion of the relationship between the coefficients and proportions of heritability.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/16?email_source=notifications&email_token=ACNCB4ZCY53ZA3X2SCPCR53QXQGKHA5CNFSM4JXSEVMKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H63DP7A, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB4YSMRD6IBRLSCLNHQDQXQGKHANCNFSM4JXSEVMA .

bschilder commented 4 years ago

I see, that makes sense. I'll give regular S-LDSC a try and try to get them that way.

Along those lines though, are the L2-regularization weights for each annotation recorded somewhere in the PolyFun pipeline? While not really enrichment, they might still be informative as to which annotations are being prioritized.

omerwe commented 4 years ago

From my experience the annotation-coefficients are not very informative. The reason is that many annotations are strongly correlated, so the coefficient for each annotation by itself is pretty meaningless. e.g. you could get negative coefficient for coding SNPs, but coding SNPs would still be strongly prioritized. The more relevant quantity is enrichment, as reported by S-LDSC. I could add a flag to report this, but it will probably be pretty similar to the S-LDSC reported enrichment, so I suggest you use that one for now...

On Mon, Dec 9, 2019 at 12:50 PM Brian M. Schilder notifications@github.com wrote:

I see, that makes sense. I'll give regular S-LDSC a try and try to get them that way.

Along those lines though, are the L2-regularization weights for each annotation recorded somewhere in the PolyFun pipeline? While not really enrichment, they might still be informative as to which annotations are being prioritized.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/16?email_source=notifications&email_token=ACNCB45TMGM3GKYOVLAGYBLQX2ANLA5CNFSM4JXSEVMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKBOWQ#issuecomment-563353434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB4YRESUFBVK7CVN3SOLQX2ANLANCNFSM4JXSEVMA .

bschilder commented 4 years ago

I see that makes sense. Colinearity would mess up the utility of the weights for this purpose. I'll give the original method a go. Thanks!

bschilder commented 4 years ago

So I tried running ldsc.py directly, but I'm getting some errors:

schilder@brians-mbp-2 ~/D/F/e/t/polyfun> python ldsc_polyfun/ldsc.py -h                                                                                                                          (polyfun_venv) 
Traceback (most recent call last):
  File "ldsc_polyfun/ldsc.py", line 11, in <module>
    import ldsc_polyfun.ldscore as ld
ModuleNotFoundError: No module named 'ldsc_polyfun'

I think this is happening because all of the scripts are set up to run everything from the scripts in the next directory up. Though I'm not sure what the best solution would be while still preserving the normal use case.

omerwe commented 4 years ago

Are you sure your code is updated from github? I moved ldsc.py to the main polyfun directory a few days ago.

On Tue, Dec 10, 2019, 1:52 PM Brian M. Schilder notifications@github.com wrote:

So I tried running ldsc.py directly, but I'm getting some errors:

schilder@brians-mbp-2 ~/D/F/e/t/polyfun> python ldsc_polyfun/ldsc.py (polyfun_venv) Traceback (most recent call last): File "ldsc_polyfun/ldsc.py", line 11, in import ldsc_polyfun.ldscore as ld ModuleNotFoundError: No module named 'ldsc_polyfun'

I think this is happening because all of the scripts are set up to run everything from the scripts in the next directory up. Though I'm not sure what the best solution would be while still preserving the normal functions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/16?email_source=notifications&email_token=ACNCB4ZQLRITLVWSSDEUOMDQX7QQLA5CNFSM4JXSEVMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGQK5SA#issuecomment-564178632, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB43CHBGA3SO34CUW32LQX7QQLANCNFSM4JXSEVMA .

bschilder commented 4 years ago

That was totally it, I thought I pulled it already but I just tried again and now I see the file moved. Works perfectly now. Thanks!

omerwe commented 4 years ago

Great!

aselewa commented 4 years ago

Hello, I am also looking for enrichments from L2-regularized LDSC. This is convenient when we have many correlated annotations.

Would it be reasonable to add up the per-SNP heritabilities reported in snpvar_ridge.gz for each annotation and divide by the # of SNPs to get the enrichment?

omerwe commented 4 years ago

PolyFun uses L2-regularization, which reduces estimation variance but adds bias. This makes statistical inference of enrichment complicated. I recommend that you run S-LDSC for this purpose. I just updated the Wiki with instructions on running S-LDSC through the PolyFun code base.

aselewa commented 4 years ago

Thanks for writing this up. Do you have any thoughts on extending LDSC to do variable selection with L1? I know many who are wrestling with this problem of which annotations to use. So far I've been just eye-balling the highly enriched ones from S-LDSC.

omerwe commented 4 years ago

In PolyFun there's no need for variable selection because it uses L2 regularization to down-weight less important annotations. Some people have been investigating this but I haven't seen an example where variable selection improves fine-mapping power.

aselewa commented 4 years ago

That makes sense. Is it possible to output the weights of the annotations from L2? I'm not interested in the enrichment, just which annotations are prioritized for finemapping (my annotations dont overlap much.) Thanks!

omerwe commented 4 years ago

It's technically possible but it's not meaningful. When you have strongly correlated annotations they 'compete` with each other: One annotation could have a strongly negative coefficient and the other a strongly positive one. It might be that these annotations are not important because they cancel each other out. The enrichment is actually probably what you want --- enriched annotations drive the fine-mapping prior.

Other possibilities are to include modern feature-selection techniques (e.g. SHAP scores) but for now I recommend that people use enrichment estimates.

aselewa commented 4 years ago

I see. Are you suggesting to run S-LDSC first with all annotations jointly, choose a subset that are highly enriched, then use this subset of annotations for generating priors with PolyFUN?

omerwe commented 4 years ago

I'm actually suggesting to run Polyfun with all of the annotations. The L2 regularizer will use all of them to maximize the accuracy of the fine-mapping prior. It may be counterintuitive but this is what works best in practice.

S-LDSC is convenient to understand which annotations are most important but it's not strictly needed.

On Fri, Mar 6, 2020, 7:23 PM aselewa notifications@github.com wrote:

I see. Are you suggesting to run S-LDSC first with all annotations jointly, choose a subset that are highly enriched, then use this subset of annotations for generating priors with PolyFUN?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/16?email_source=notifications&email_token=ACNCB4Y3UMHBK6FKLSEPOODRGGHYHA5CNFSM4JXSEVMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEODIL5Y#issuecomment-596018679, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB4YXG35RH66MNJVX7O3RGGHYHANCNFSM4JXSEVMA .

aselewa commented 4 years ago

That makes a lot of sense! Thanks for the discussion. (Apologies to OP for hijacking this thread.)

bschilder commented 4 years ago

Haha, no worries, @aselewa. I actually found the discussion helpful.