omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

Updates to HESS implementation #71

Closed jdblischak closed 2 years ago

jdblischak commented 2 years ago

This PR includes multiple updates related to the HESS implementation and discussed in Issue #70.

So far I've done the following tests:

I'm working on a more large-scale test to confirm that the HESS estimates are similar when using estimate_h2_hess_wrapper() compared to the current default. I wanted to submit the PR now for any feedback. Thanks!

omerwe commented 2 years ago

Thanks @jdblischak, looks great to me so far! I don't have any constructive feedback :)

jdblischak commented 2 years ago

Alright, I found a discrepancy between estimate_h2_hess() and estimate_h2_hess_wrapper().

estimate_h2_hess() does not define pvalue_bound by default.

https://github.com/omerwe/polyfun/blob/ee47aaccfbe6e7001e20ac2f9baf6b12d45170ea/finemapper.py#L611

In contrast, by default estimate_h2_hess_wrapper() sets min_h2=1e-4

https://github.com/omerwe/polyfun/blob/ee47aaccfbe6e7001e20ac2f9baf6b12d45170ea/finemapper.py#L645

which is used to define a pvalue_bound that is passed to estimate_h2_hess()

https://github.com/omerwe/polyfun/blob/ee47aaccfbe6e7001e20ac2f9baf6b12d45170ea/finemapper.py#L653-L656

If none of the SNPs meet this additional p-value threshold, then h2_hess is 0:

https://github.com/omerwe/polyfun/blob/ee47aaccfbe6e7001e20ac2f9baf6b12d45170ea/finemapper.py#L622-L628

which ultimately results in an error:

https://github.com/omerwe/polyfun/blob/ee47aaccfbe6e7001e20ac2f9baf6b12d45170ea/finemapper.py#L746-L747

I tested my PR by performing genome-wide fine-mapping with --hess of the UKBB trait blood_EOSINOPHIL_COUNT. The majority of the loci failed because the HESS estimate was 0 (i.e. there were no SNPs in the loci that passed the p-value threshold).

How would you like me to proceed? Setting min_h2=None would restore the previous behavior of estimate_h2_hess(). Currently there is no command-line control of either min_h2 or pvalue_bound.

omerwe commented 2 years ago

These are very good questions, which show that this process is quite subtle... p-values depend on sample size, so the p-value threshold to use is somewhat arbitrary. I used this technique with UKBB data, which is very large. I found a p-value threshold that worked well in that setting. I indeed didn't fine-map loci in which all SNPs had p>1e-4, assuming that it's extremely unlikely to find a causal SNP in such loci. However, I never investigated how to specify a good per-SNP h2 threshold in general.

Perhaps the best path forward is to add a flag specifying the per-SNP h2 threshold, and forcing users to specify a value if they use --hess. This way, users will at least be aware of this subtlety and the need to make a somewhat arbitrary decision when using --hess. How does that sounds?

jdblischak commented 2 years ago

I manually set min_h2=None for estimate_h2_hess_wrapper() and fine-mapped the UKBB trait blood_EOSINOPHIL_COUNT. The plots below compare the HESS estimate from a single iteration (current behavior) versus the mean over 100 iterations (proposed behavior). The results aren't radically different, so most users shouldn't notice much difference.

image

image

jdblischak commented 2 years ago

I used this technique with UKBB data, which is very large. I found a p-value threshold that worked well in that setting. I indeed didn't fine-map loci in which all SNPs had p>1e-4, assuming that it's extremely unlikely to find a causal SNP in such loci.

What did you set for min_h2 when you were mapping the UKBB traits? When I use the current default setting to finemap blood_EOSINOPHIL_COUNT, I get a p-value threshold that is much more stringent that 1e-4:

# estimate_h2_hess_wrapper()
(Pdb) min_h2
0.0001
(Pdb) self.n
513859
(Pdb) stats.chi2(1).sf(min_h2 * self.n)
7.58815914842483e-13

As an example, the locus chr14:64000001-67000001 has no SNPs with p<7.5e-13 and is thus not fine-mapped:

# estimate_hs_hess()
(Pdb) self.df_sumstats_locus.query('P < 1e-4').shape[0]
281

(Pdb) self.df_sumstats_locus.query('P < 7.58815914842483e-13').shape[0]
0

Perhaps the best path forward is to add a flag specifying the per-SNP h2 threshold, and forcing users to specify a value if they use --hess. This way, users will at least be aware of this subtlety and the need to make a somewhat arbitrary decision when using --hess. How does that sounds?

I don't prefer this solution mainly because of backwards compatibility. This new requirement for a threshold would break existing user's code if they are using --hess. In contrast, implementing the iterations will give them slightly better estimates for negligible extra computing time.

Here's my counter proposal:

  1. Change the default of min_h2 to None. This would keep the new behavior similar to the current behavior
  2. Add a new optional flag (--hess-h2-thresh?) that allows the user to explicitly set min_h2 to a threshold that makes sense for their data
  3. If a user sets --hess but not --hess-h2-thresh, log a warning that they should consider setting --hess-h2-thresh
omerwe commented 2 years ago

Hi @jdblischak,

It's been a few years so I don't remember all the details off the top of my head, but according to the manuscript we excluded SNPs with alpha squared <0.00005.

I like your suggestion! I'll be very grateful if you could implement it :)

jdblischak commented 2 years ago

according to the manuscript we excluded SNPs with alpha squared <0.00005

Ok, thanks for clarifying. To make sure I understand: this means that there is no way using the current PolyFun code to perform the same SNP filter for the HESS estimate? In other words, the analysis for the paper filtered SNPs based on their effect size, but the code only allows using a minimal heritability threshold (which gets translated to a p-value threshold).

I like your suggestion! I'll be very grateful if you could implement it :)

Great! I've pushed my proposal. I decided to name the flag --hess-min-h2 to better match the existing function argument min_h2. If you'd prefer --hess-h2-threshold or any other name, please just let me know.

I've performed some minimal tests. I've submitted a more expansive test to run, so I'll report on those results tomorrow.

jdblischak commented 2 years ago

I've finished my more comprehensive tests. I've confirmed that setting --hess-min-h2 is able to affect the HESS estimate when its corresponding pvalue_bound is more stringent (ie lower) than the pvalue_cutoff derived from taking the top .5% of SNPs in the locus. Otherwise it has no effect.

Please review and let me know if you'd like me to make any changes prior to merging.

jerome-f commented 2 years ago

@jdblischak Thanks for investigating this. it might help to log information on how to set --hess-min-h2 i.e. if--hess-min-h2 0.0001 for a n of 500k then the p-value threshold is on the order of e-13. If a user is setting -p-value treshold in the previous step create_finemapper.py then how should the --hess-min-h2 be changed ?

omerwe commented 2 years ago

Thanks @jdblischak, it looks great!

To make sure I understand: this means that there is no way using the current PolyFun code to perform the same SNP filter for the HESS estimate? In other words, the analysis for the paper filtered SNPs based on their effect size, but the code only allows using a minimal heritability threshold (which gets translated to a p-value threshold).

It's approximately the same thing: alpha_i^2 is the estimated heritability tagged by SNP i (assuming the SNP is standardized).

@Flaviyan-Jerome, as I've said I never investigated the optimal strategy to specify hess-min-h2. The value that I chose worked well for UKBB, but I can't guarantee it work well elsewhere... Unfortunately the best strategy is probably to play with this threshold and get a (subjective) assessment of whether the per-SNP causal h2 seems to make sense.

jdblischak commented 2 years ago

@omerwe Thanks for merging!

It's approximately the same thing: alpha_i^2 is the estimated heritability tagged by SNP i (assuming the SNP is standardized).

Ok, thanks for the explanation.

it might help to log information on how to set --hess-min-h2 i.e. if --hess-min-h2 0.0001 for a n of 500k then the p-value threshold is on the order of e-13. If a user is setting -p-value treshold in the previous step create_finemapper.py then how should the --hess-min-h2 be changed ?

Unfortunately the best strategy is probably to play with this threshold and get a (subjective) assessment of whether the per-SNP causal h2 seems to make sense.

@Flaviyan-Jerome Thanks for the feedback. I agree on both points: 1) guidance would be nice, and 2) it's hard to provide this guidance automatically given how context-dependent it is. Not only does sample size play a role, but the distribution of p-values of the SNPs in any given region will determine if the --hess-min-h2 has any effect at all.

In practice I found it difficult to set a single --hess-min-h2 for genome-wide fine-mapping. When fine-mapping the UKBB trait blood_EOSINOPHIL_COUNT, if I set a stringent min_h2 of 7e-5, then this actually had an effect on regions with lots of signal (since it corresponds to a p-value cutoff of 2e-9). The downside though is that then the majority of the regions with modest (or no) signal failed to finemap at all (since the HESS estimate was zero because no SNPs passed the threshold). When I set a more permissive min_h2 of 2.75e-5, all the genome-wide regions could be fine-mapped, but only the regions with more modest signal were impacted (since it corresponds to a p-value cutoff of 1.7e-4).

To gain feedback on whether --hess-min-h2 had any effect on a locus, I temporarily added a logging statement to estimate_h2_hess(). The reason I didn't include this is in the PR is because this function is called 100 times, and thus it floods the log file.

        if pvalue_bound is not None and pvalue_cutoff>pvalue_bound:
            logging.info('HESS: Actually used p-value cutoff of %0.4e instead of %0.4e', pvalue_bound, pvalue_cutoff)
            pvalue_cutoff = pvalue_bound

My guidance to a user would be to start with only using --hess. The default of only including the top 0.5% of SNPs in a region already does a good job of only including the most significant SNPs. If the HESS estimates aren't reasonable, then they can experiment with --hess-min-h2. If the HESS estimates seem off for regions with less signal, then set a permissive min_h2. If on the other hand the HESS estimates seem off for regions with lots of signal, set a more stringent min_h2.

omerwe commented 2 years ago

@jdblischak I agree with your points. However, I would argue that if no SNP has p<2e-9 in some region in UKBB, then it's extremely unlikely to find a causal SNP in that region that can be accurately fine-mapped (even if we knew the true causal h2). Hence the behavior you described might not be so problematic. Of course, this rational applies to UKBB and may not apply to other studies.

jdblischak commented 2 years ago

Hence the behavior you described might not be so problematic

@omerwe Good point, and I agree. From a scientific perspective, there is unlikely to be an issue since there probably isn't any signal in the region anyways. I was thinking from more of an infrastructure perspective. When I create a bioinformatics pipeline to perform genome-wide fine-mapping, I'd prefer to have to handle as few special cases as possible. I'd rather it simply finemap as many regions as possible, and then I can analyze the results to find the interesting signals.

omerwe commented 2 years ago

@jdblischak I agree!