omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
94 stars 22 forks source link

The linear combination of ploypred with ployfun and bolt-lmm results in poorer prediction performance compared to using bolt-lmm alone #195

Closed Y-Isaac closed 5 months ago

Y-Isaac commented 6 months ago

Hi,

I successfully ran polypred without any errors, and the software also provided mixweight. One of the results is as follows:

polyfun 0.4593921186807445
bolt-lmm    0.8357840401238336
intercept   -0.24858259149910983

Subsequently, I applied the beta obtained from polypred to UKB non UK, East Asia, South Asia, and African populations (as shown in your manuscript). But what frustrates me is that the predictive performance in almost every population is worse than before (compared to simply using bolt-lmm). For example, in non British populations in Europe, the R-squared obtained using only bolt-lmm was 0.06181, but when combined with polypred, it became 0.04259. I don't quite understand this result, which makes me doubt if it was due to an unknown error when running polyfun earlier. Do you have any opinions or suggestions on this?

Thanks for you help in advanced!

Best regards, Wentao

omerwe commented 6 months ago

@Y-Isaac, there are no promises when performing predictions, but statistically the predictions should not get worse (if PolyPred indeed doesn't contribute anything, its mixing coefficient should be set to be approx. zero). It all boils down to the identity and size of the target population used to estimate the mixing weights. If it's individuals from the the exact same population on which you later on estimate the accuracy (but different individuals), I would expect predictions performance to improve, or at least not deteriorate.

Some things you could try out:

  1. Use the very same individuals to estimate mixing weights and to estimate accuracy. This is cheating of course, but it would be a good sanity test. At least in that scenario I would expect prediction accuracy to improve over BOLT-LMM

  2. Increase/decrease the size of the sample used to estimate mixing weights, and see how this affects your accuracy on the test sample

Hope it's clear, please let me know if not and how it goes if it is.

Y-Isaac commented 6 months ago

@Y-Isaac, there are no promises when performing predictions, but statistically the predictions should not get worse (if PolyPred indeed doesn't contribute anything, its mixing coefficient should be set to be approx. zero). It all boils down to the identity and size of the target population used to estimate the mixing weights. If it's individuals from the the exact same population on which you later on estimate the accuracy (but different individuals), I would expect predictions performance to improve, or at least not deteriorate.

Some things you could try out:

  1. Use the very same individuals to estimate mixing weights and to estimate accuracy. This is cheating of course, but it would be a good sanity test. At least in that scenario I would expect prediction accuracy to improve over BOLT-LMM
  2. Increase/decrease the size of the sample used to estimate mixing weights, and see how this affects your accuracy on the test sample

Hope it's clear, please let me know if not and how it goes if it is.

Hi, following your advice, I conducted a comprehensive test on this issue. I tried training set sizes of 200, 500, and 1000, using different seeds for sampling each time, and tested the predictive performance of using bolt-lmm, polyfun, and polypred alone (represented by R-squared). Unfortunately, I found that combining these methods almost never resulted in additional improvement over using them separately. The combined R-squared values were often between those of the two methods, and sometimes even worse. To give you a better understanding of this situation, I have included images of the weights and R-squared results from the three training iterations below. 1715255094518

Additionally, I tried reapplying the results obtained from the training set of 1000 to the same group of 1000 people. Strangely, even with this approach close to "cheating," I still did not achieve good results, which is very puzzling to me. I have also included the R-squared data from this attempt below. 1715255052897

I rechecked the results of the individual race classifications and I am confident that the problem does not stem from there; I also tried adjusting the effect genes in the outputs from bolt-lmm and polyfun, which in fact did not alter the results.

I wonder what other possible reasons could cause this phenomenon? Thanks for your help in advanced!

Best regards, Issac

omerwe commented 6 months ago

@Y-Isaac can you please try two things:

  1. Please git pull the code and rerun the code that estimates the mixing weights. I modified it so that it prints in-sample R2 as the moment of fitting, i.e.:

    [INFO]  In-sample R2: 0.011

    Can you please check if these R2 values correspond to the ones you're getting when fitting PolyFun and BOLT-LMM?

  2. Separately, can you please rerunning these with the flag --allow-neg-mixweights? I wonder if you'll see positive results then (though I'm not sure why this would happen).

Y-Isaac commented 6 months ago

@Y-Isaac can you please try two things:

  1. Please git pull the code and rerun the code that estimates the mixing weights. I modified it so that it prints in-sample R2 as the moment of fitting, i.e.:
[INFO]  In-sample R2: 0.011

Can you please check if these R2 values correspond to the ones you're getting when fitting PolyFun and BOLT-LMM?

  1. Separately, can you please rerunning these with the flag --allow-neg-mixweights? I wonder if you'll see positive results then (though I'm not sure why this would happen).

Hi, thanks for your help!

1.Due to the server I am using not being able to connect to the internet, I am unable to use the git pull command. I downloaded and uploaded the polypred. py script directly on the page, but I noticed that the script seems to have not been updated recently. I tried to run the code again as below:

python ./polyfun/polypred.py \
--combine-betas \
--betas ALT_hess_agg.txt.gz,ALT_pred_eur \
--pheno ./ALT/ALT_nonbritish_500.txt \
--output-prefix ./output/ALT/ALT_nonbritish \
--plink-exe ./plink/plink \
./training_data/nonbritish/ALT/chr*.bed

I don't seem to see the in sample R-squared value from the output file, do I need to make modifications to my code?

2.I try to add the arg --allow-neg-mixweights, but the mixweights did not change.

Thanks for your help in advance!

Best regards, Issac

omerwe commented 6 months ago

I forgot to push, apologies! Pushed now, can you please try again?

Y-Isaac commented 6 months ago

I forgot to push, apologies! Pushed now, can you please try again?

@omerwe Hi,

It's okay, I have now successfully used the new version of the script. I recalculated using the arg --combine-betas for 500 European samples, and strangely, the program reported [INFO] In-sample R2: 0.073, which is higher than when using polyfun and bolt-lmm separately, and different from the earlier test result.

Then, I used the arg --predict to apply this beta file to the same 500 European samples and combined the resulting .prs file and pheno file using the Excel VLOOKUP function into a single document. I then analyzed this document using the R lm() function for a univariate linear regression, and looked at the Multiple R-squared results. To my surprise, the R2 value was 0.0176. (additionally, the prs for polyfun and bolt-lmm also come from arg --predict)

I'm not sure which step might have caused the inconsistency in the R2 results. Do you have any advice? I sincerely hope that my question does not cause you any trouble! The all code I have used is as below. If you are willing to personally test my results, I can also send the data to your email.

python ~/software/polyfun/polypred.py \
    --combine-betas \
    --betas .betas/ALT_hess_agg_filter.txt.gz,./betas/ALT_pred_eur \
    --pheno ./training_data/pheno_ALT_nonbritish_500.txt \
    --output-prefix ./output/ALT_nonbritish \
    --plink-exe ~/software/plink/plink \
    ./training_data/chr*.bed

python ~/software/polyfun/polypred.py \
  --predict \
  --betas ./output/ALT_nonbritish.betas \
  --output-prefix ./output/ALT_nonbritish \
  --plink-exe ~/software/plink/plink \
  ./training_data/chr*.bed

  ##use VLOOKUP to combine the pheno data and prs data

  ##R (I also have used python to calculate R2, the result is same)
  fit1 <- lm(ALT_int~score.combine,data=data1)
  summary(fit1)
omerwe commented 6 months ago

@Y-Isaac if you can send anonymized PRS predictions BOLT-LMM and PolyFun to omer.we@gmail.com (or share them from a shared drive), I'm happy to take a look to try and figure this out.

Y-Isaac commented 6 months ago

@Y-Isaac if you can send anonymized PRS predictions BOLT-LMM and PolyFun to omer.we@gmail.com (or share them from a shared drive), I'm happy to take a look to try and figure this out.

Hi, I have sent you an email, please check it!

Y-Isaac commented 6 months ago

@omerwe HI, sorry to bother again, I want to know that if you have any new ideas on this?

omerwe commented 6 months ago

@Y-Isaac apologies for the delay. I was able to reproduce the problem and I traced it down to an unexpected behavior of plink. It turns out that the PRS computed by plink depends on the number of SNPs in the input in some unexpected way. Specifically, I verified that if I include additional SNPs with effect size zero in Plink's input file, it outputs a different (inaccurate) PRS estimate than if I omit these SNPs. At the moment I have no idea why this happens...

Unfortunately I have limited bandwidth to work on this. My advice for now is to manually compute the weighted PRS by (1) computing the PRS with BOLT-LMM and with PolyFun separately, and (2) linearly combining them using the mixing weights reported in the mixing weights file (which is created when invoking --combine-betas). I verified that the mixing weights file is accurate, the only problem is with plink.

Unless I find why this happens, I will simply change the behavior of polypred.py to compute a PRS using each method separately, and then linearly combine the PRSs with the mixing weights, instead of linearly combining the SNP effect sizes. This will be slower but will solve the problem for sure (this is in fact what I did while working on the PolyPred paper).

I have no idea why plink's behavior depends on the number of SNPs in the input files, this is unexpected and surprising. If anyone reading this has any idea, I'm happy for suggestions.

Y-Isaac commented 6 months ago

@omerwe Thanks for answer, it is really helpful! it's indeed an unexpected result that plink makes it. I am so sorry that I have three questions to consult you:

1.You mentioned that there might be an unpredictable variation in PRS calculation due to sample size when using plink. But I'm curious, why is it that only polypred is affected, considering it contains a similar number of variants as polyfun? (I understand this question involves plink and might be hard to trace, so if you don't have a clear idea about it at the moment, feel free to skip!)

2.You said the current solution is to calculate PRS separately using both methods and then weight them according to arg --combine-betas. Do you plan to implement this in polypred.py?

3.In #80 , you mentioned, "I only list SNPs with posterior mean beta squared greater than 1e-5 and other SNPs can be dropped because they don't contribute anything." I'm wondering if removing some low or non-effective SNPs from the polyfun output file might be helpful? I want to test it. (Just to confirm, you say only keeping the variant with beta squared greater than 1e-5, do you mean removing variants with beta less than 3.1e-4(sqrt(1e-5)), as I understand it?)

I'm deeply grateful for all your assistance throughout! Hope you everything well!

omerwe commented 6 months ago

@Y-Isaac good questions, let me try to answer!

  1. The weird plink behavior affects both BOLT-LMM and PolyFun . If I create effect sizes given by xBOLT_beta + yPOLYFUN_beta (where BOLT_beta=0 for each SNP not included in the BOLT-LMM analysis), I expect the PRS of each individual to be xPRS_BOLT + yPRS_POLYFUN (where PRS_BOLT, PRS_POLYFUN are the PRS computed by BOLT and PolyFun, respectively), but this simply isn't the case. I believe this is related to the fact that PolyFun includes many more SNPs with non-zero effect size than BOLT. If anything, I believe that the "problem" is with BOLT, but I'm really at a loss to why Plink behaves the way it does... I'm honestly shocked about this, but I just don't have the bandwidth to investigate at length.

  2. Yes, this should be simple to implement. The PRS calculation will just be slower, but other than that it's a clean solution. Hopefully I'll get to it this week

  3. I wouldn't try "hacking" the PRS in this way. But yes, your understanding of my filtering is correct.

Y-Isaac commented 6 months ago

@omerwe Thanks for answering! I have tested the results after reducing the number of sites contained in polyfun(using the absolute value 1e-5 as a threshold), there is also a significant difference. So, as you said, it is very necessary to estimate PRS separately and then perform linear combination. Thanks for your effort!

An additional small question to be confirmed: in your test, . mixweights file is accurate, and the . beta file is inaccurate, right? (They are all generated by arg --combine-betas). This means that currently we cannot obtain a direct variant weight file in conventional style from POLYPRED for others to use this PRS? (This question is only for my own more accurate understanding of this issue).

I am not sure should I close this issue now, please make a decision.

omerwe commented 5 months ago

@Y-Isaac correct, the mixweights file is perfectly fine, meaning that we cannot obtain a direct weighted variants file because of the weird behavior of plink...

I've just updated polypred.py as discussed (computing a weighted PRS using a linear combination of PRS instead of a linear combination of SNP effect sizes). I've tested it on the data you've sent me and it seems to work well. Can you please git pull and let me know how it goes?

Please note that the updated mechanism is slightly different, and I've updated the Wiki accordingly. There are basically two changes:

  1. I replaced the name of the flag --combine-betas with the updated name --estimate-mixweights.
  2. When calling polypred.py with the --predict flag you need to give it:
    • The exact same parameter --betas as in the previous step (i.e., a comma-separated list of files)
    • A new parameter called --mixweights-prefix, which simply points to the prefix of the mixing weights file created in the previous step (i.e, this parameter must be the same as the parameter --output-prefix in the previous call to polypred.py -- the one with the --combine-betas flag)

I'm curious to hear if this resolves your issues.

Y-Isaac commented 5 months ago

@omerwe HI,

I have tested the new script behavior, it performs very well, compared to using polyfun or bolt-lmm alone, Polypred has achieved an improvement in predictive performance in most populations, at least without a significant decrease in performance, which is in line with our expectations.

There are a few prompts during my operation, and I think it is necessary for me to provide you with feedback:

1.When running the script polypred. py --predict_, multiple PerformanceWarnings often occur. Of course, based on my speculation, it should not affect the PRS results. The specific content is as follows:

polypred.py:255: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

2.At the end of the script polypred. py --predict, a Value Error appeared, but the .prs file was still successfully generated. I am not sure if this Error actually affected the results. Here are the specific details:

polypred.py:357: FutureWarning: The squeeze argument has been deprecated and will be removed in a future version. Append .squeeze("columns") to the call to squeeze.

s_mixweights = pd.read_csv(mixweights_file, delim_whitespace=True, squeeze=True) Traceback (most recent call last): File "polypred.py", line 506, in compute_prs(args) File "polypred.py", line 372, in compute_prs df_prs_sum_jk = pd.DataFrame(index=df_prs_all.index, columns=set_jk_columns) File "anaconda3/envs/polyfun/lib/python3.8/site-packages/pandas/core/frame.py", line 640, in init raise ValueError("columns cannot be a set") ValueError: columns cannot be a set

Thank you for your efforts!

omerwe commented 5 months ago

@Y-Isaac thanks for confirming! All these errors and warnings are due to newer pandas versions not being fully backwards-compatible. I just git pushed a newer version that should take care of all of these, can you please git pull and confirm that the code now runs without errors or warnings? (thankfully the code crashed after computing the PRS, so the only thing that wasn't computed were the jackknife estimates of the PRS).

Y-Isaac commented 5 months ago

@omerwe Everything is performing perfectly now, and I have no further issues here. Thank you!

omerwe commented 5 months ago

@Y-Isaac thanks for confirming! Can we close the issue?

Y-Isaac commented 5 months ago

@omerwe Sure, I am going to close it now.