omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

Polypred gives 0 weights for finemap effects #80

Closed zhilizheng closed 2 years ago

zhilizheng commented 2 years ago

Hi Omer,

I tried Polypred and it give us consistent weights of 0 for finemap effets. Here are the code I tried: $python aggregate_finemapper_results.py \ --out-prefix output/polyfun_output \ --sumstats output/weights.snpvar_ridge_constrained.gz \ --out output/polyfun_output2.agg.txt.gz \ --adjust-beta-freq

aggregate all of the results

$python aggregate_finemapper_results.py \ --out-prefix output/polyfun_output \ --sumstats output/weights.snpvar_ridge_constrained.gz \ --out output/polyfun_output2.agg.txt.gz \ --adjust-beta-freq

$python polypred.py \ --combine-betas \ --betas polypred_example/test.betas.gz,output/polyfun_output2.agg.txt.gz \ --pheno polypred_example/test.pheno \ --output-prefix output/polypred \ --plink-exe ~/bin/plink \ polypred_example/test.*.bed

This give us the weights like this, the prediction accuracy are dominanted by tagging effects.

mix weights:

mix_weight .....test.beta.gz 0.4662990827518956 .....polyfun_output2.agg.txt.gz 0.0 intercept 0.12705637455594848

Could you help me with the problems?

Thanks. Regards, Zhili

omerwe commented 2 years ago

Hi @zhilizheng, I don't think this is a real problem.

PolyPred uses non-negative least-squares estimation, so it sometimes returns 0 if if "decides" that one of the predictors is not informative. I guess this is what's going on in this (artificial) example. This shouldn't happen in real data --- this is just not a good example of fine-mapping in this context...

Please let me know if you also get this behavior when working with real datasets.

Thanks,

Omer

zhilizheng commented 2 years ago

Hi Omer,

Thanks for your prompt reply. The demo data runs fine, I revised the code to run the demo data with functional annotations.
It also run without problem. But when I try the real data, it is not the case. I don't know whether I made something wrong here, could you help me have check?

# Summary
$python munge_polyfun_sumstats.py \
    --sumstats polypred_example/bolt.sumstats.txt.gz \
    --out output/bolt.sumstats.munged.parquet \
    --n 337491

# functional weights
annot="FOLDER_TO_BASELINE_LF"
$python polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix output/weights \
    --sumstats output/bolt.sumstats.munged.parquet \
    --ref-ld-chr $annot/baselineLF2.2.UKB. \
    --allow-missing \
    --w-ld-chr $annot/weights.UKB.

# Then I merge output/weights.${chr}.snpvar_ridge_constrained.gz into a single file output/weights.snpvar_ridge_constrained.gz
# I also tried to use the output without constrain,  it didn't work, thus use the contrained weights. 

#create fine-mapping jobs
$python create_finemapper_jobs.py \
    --sumstats output/weights.snpvar_ridge_constrained.gz \
    --n 337491 \
    --method susie \
    --python3 $python \
    --max-num-causal 5 \
    --out-prefix output/polyfun_output \
    --jobs-file output/jobs.txt 

# run all the jobs
bash output/jobs.txt 

#aggregate all of the results
$python aggregate_finemapper_results.py \
    --out-prefix output/polyfun_output \
    --sumstats output/weights.snpvar_ridge_constrained.gz \
    --out output/polyfun_output.agg.txt.gz \
    --adjust-beta-freq

# tuning
$python polypred.py \
    --combine-betas \
    --betas polypred_example/bolt.betas.gz,output/polyfun_output.agg.txt.gz \
    --pheno polypred_example/1000G.pheno.txt \
    --output-prefix output/polypred \
    --plink-exe ~/bin/plink \
    polypred_example/test.*.bed

I run those code in the demo data, it gives a correct weight (means the weight for the finemap is not 0).

However, when I run the real data (height, summary data from linear regression in unrelated, tag effect from SBayesR), it gives the weight for finemap = 0. It looks really strange. I use 500 AFR samples as tuning samples, it has 486 valid phenotype for this trait. I also tried the BMI, the weight still = 0 strangely.

I tried to get the weight by myself. pheno ~ PRSf (18M SNPs) + PRStag (1M SNPs), use the nnls, it gives exactly the same weights as shown in the weight file run by Polypred (weight for finemap = 0). I tried the linear regression, this gives negative coefficient for finemap PRS indeed, thus I think the nnls constrained it to 0.

I don't know whether there are some steps I did wrong.

Many thanks for your help.

Regards, Zhili

omerwe commented 2 years ago

Hi @zhilizheng,

Your script looks correct. The only think I can think about is that you might want to remove the --no-parittions flag to increase accuracy. However, I doubt this is the source of the issue.

The top concern is that the fine-mapping results are just not good enough for some reason. This is more difficult to debug, but I would start by looking at some loci with relatively-established fine-mapping results (e.g. for height), and compare your results to what's known. For example, you can look at the fine-mapping results we published in the PolyFun paper for height, and see if your results look similar to ours for some key loci. If they aren't then it's not surprising that the nnls decides not to use these results...

If the fine-mapping results are too non-informative to be useful, I guess you would have to dig deeper into these.

One other thing you can do is run non-functionally-informed fine-mapping first. If this solves the problem, maybe the functional annotations mess up the fine-mapping somehow...

Sorry I can't be more help --- this is true science, you just have to dig into it :)

zhilizheng commented 2 years ago

Hi Omer,

Many thanks for your time and great suggestions. I compared the two sets of finemap results: my result (at most 10 causal in each block) from linear regression (as 1), polyfun (body_HEIGHTz.txt.gz, as 2).

There are 424K SNPs in common out of 479K SNPs between those two results.

  1. The correction of the z square converted from P1 and P2 is very high, r = 0.98
  2. The correlation of BETA_MEAN and is moderate, r = 0.75. This could be expected, as we don't know the true causal, the two FINEMAP may have different settings.
  3. The correction of PIP is also moderate: r = 0.56.

I don't know what to look into for more details, I'm quite naive in the finemap. From my understanding, the two finemap results may be similar.

Appreciate if you can give me more suggestions to compare.

Regards, Zhili

zhilizheng commented 2 years ago

Hi Omer,

I find the body_HEIGHTz.PolyFun-pred.gz file from your shared Polypred results, only contains 478,840 SNPs. Did you perform some filterring on the finemap results (original 18M SNPs)? I can't find the criterion to perform the filter, maybe I miss something here.

Thus, my result may be polluted by lots of uninformative SNPs. The filter to get a high quality finemap results may get it better.

Let me know if my guess is correct or not. Thanks.

Regards, Zhili

omerwe commented 2 years ago

Hi @zhilizheng ,

1) As to your latter question: I only list SNPs with posterior mean beta squared greater than 1e-5 IIRC (this is written in the README file in the sumstats directory I think). Other SNPs can be dropped because they don't contribute anything. But nothing would have changed had I kept those SNPs. It just would have made the computation longer (and require more memory).

2) Suggestion: What happens when you use PolyPred to combine results from SBayesR (using your data) and from the fine-mapping results that we published? Does the fine-mapping component get a coefficient >0 then? If that's the case then this means that (a) you could use our results; and (b) something is wrong with your own fine-mapping results.

Please note that we expect that our results would be useful for non-European populations, because they're based on fine-mapping.

3) The correspondence between the fine-mapping correspondence is sensible, so you're definitely on the right track. The main issue is that fine-mapping is complex and involves lots of subtleties. You need in-sample LD, you need to have a large sample-size (N>50K, and ideally N>100K), and you need to do very careful work (e.g. correct handling of multi-allelic SNPs). I don't know what's your dataset, but I would start with my suggestion (2) from above. If you don't have a very large dataset and/or a highly heritable trait, maybe you just don't have enough data to get very good fine-mapping results. In any case, you should be able to use our own published fine-mapping results to improve your PRS accuracy (if you're studying one of the 49 traits that we analyzed).

I can also suggest that using BOLT-LMM (instead of linear regression) to generate sumstats can provide more powerful fine-mapping results.

Hopefully this helps, please let me know if I can assist with anything else!

Omer

zhilizheng commented 2 years ago

Hi Omer,

Thanks for your detailed intructions to check. I find the issue comes from the allele order. I plot the two PRS, one from finemap and another from SBayesR, and find the PRS are in an opposite direction.

Take an example: rs9999968

My Finemap result: A1=T A2=A MAF=0.0067897 BETA_MEAN=-8.207266e-08

Finemap result online from your website: A1=A A2=T MAF=0.007929 BETA_MEAN=-4.948484e-05

I speculate I may miss the alleles order.

I checked which step I missed: My summary: A1=T A2=A MAF=0.00678967 BETA=0.0265166

BOLT Summary online: A1=A A2=T A1FRQ=0.992071 BETA=-0.031947

Looks like my summary is correct.

Then I look into steps from polyfun: *snpvar_ridge_constrained.gz A1=T A2=A MAF=6.7897e-03 Z=1.8486 It's correct here.

And I find a bug in susie finemap steps when the alleles were flipped, the output didn't flip the alleles order. Thus the PRS generated was in an opposite direction, hence the weights are negative, and constrained to 0 in nnls.

I fixed the issue in the pull request, and tested without problem, it gave correct weights (not 0) and improved prediction accuracy. Hope it helps.

Thanks for your time, and the other useful suggestions.

Regards, Zhili Zheng

omerwe commented 2 years ago

Hi @zhilizheng,

Great catch, and what you found makes perfect sense. Thanks for the pull request! However, I'm still not sure that I should flip the coefficient in this case. It's not clear that we should treat the summary statistics as "correct", rather than than the LD reference panel.

There's a better more general solution: Flip PRS if they're negatively correlated with the target phenotype before combining them in polypred.py. I fixed the code of polypred.py to do this now. Can you please git pull and let me know if this solves the problem? If yes, I think I'll close your pull request for now.

Many thanks again,

Omer

zhilizheng commented 2 years ago

Hi Omer,

Thanks for the quick fix. I tested in a small example and find the new code makes better, but I find the prediction accuracy is not high as my previous fix. I checked, in my case, majority part of SNPs flipped the effect allele, but not all.

I think it is not related to treat the allele order in summary statistics or LD rereference as the "correct". The BETA sign shall match the correct allele order, which indicates the direction of an allele effect.

case 1, my situation:

rs1 (same allele order as LD), rs2 (opposite) Summary rs1 A1=A A2=T beta=1 rs2 A1=C A2=T beta=-1 LD: rs1 A1=A A2=T rs2 A1=T A2=C

Your code in susie will output: rs1 A1=A A2=T beta=0.5 rs2 A1=C A2=T beta=0.5

case 2

If I switch allele orders before input to finemap, the two SNPs are in the same order to LD

rs1 A1=A A2=T beta=1 rs2 A1=T A2=C beta=1

Your code in susie will output: rs1 A1=A A2=T beta=0.5 rs2 A1=T A2=C beta=0.5

There are two ways to fix this:

  1. keep allele order from LD reference like your FINEMAP code branch. (line 1030) df_z.loc[is_flipped, 'A1'] = self.df_sumstats_locus.loc[is_flipped, 'A2'] df_z.loc[is_flipped, 'A2'] = self.df_sumstats_locus.loc[is_flipped, 'A1'] df_z.loc[is_flipped, 'Z'] *= (-1) The output also swap the A1 and A2 when is_flipped is true, matched the sign of final beta.
    For case 1: the output will be: rs1 A1=A A2=T beta=0.5 rs2 A1=T A2=C beta=0.5
  2. keep allele order from user input. But change the sign for the beta after the finemap, like my fix. For case 1, the output will be: rs1 A1=A A2=T beta=0.5 rs2 A1=C A2=T beta=-0.5

I like the second fix, as users usually keep same allele order for the two input file (summary for 18M, summary for tagging effect) , the output also keeps its orginal allele order.

The PRS are related to the effect allele. The input for the PRS: SNP, A1, beta. If the sign of beta is not match with A1, the final PRS will have a low correlation with phenotype, the correlation could be negative or positive depending on the proportion of SNPs that doesn't match the sign of effect and to its allele order (correct A1 to the sign of beta).

Hopes this helps.

Regards, Zhili

omerwe commented 2 years ago

Thanks @zhilizheng, great analysis! I got convinced and merged your pull request. Many thanks!