omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

Error with aggregate_finemapper_results.py #113

Closed mkoromina closed 2 years ago

mkoromina commented 2 years ago

HI @omerwe,

I have been trying to merge the results from my genome-wide fine-mapping results but I stumbled upon the following issue. I ran:

python /path/to/polyfun/aggregate_finemapper_results.py \
    --out-prefix  /path/to/polypred/results_gwa_finemap/polyfun_output \
    --sumstats /path/to/files/stats_munged.parquet \
    --out /path/to/polypred/polyfun_output_new.agg.txt.gz \
    --adjust-beta-freq

However, this is the full error message that I get:

2745it [01:42, 26.81it/s]
Traceback (most recent call last):
  File "/path/to/polyfun/env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAF'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/polyfun/aggregate_finemapper_results.py", line 113, in <module>
    main(args)
  File "/path/to/polyfun/aggregate_finemapper_results.py", line 77, in main
    df_sumstats['BETA_MEAN'] /= np.sqrt(2*df_sumstats['MAF']*(1-df_sumstats['MAF']))
  File "/path/to/polyfun/env/lib/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/path/to/polyfun/env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'MAF'

Do you know what could be potential wrong? Many thanks once again!

omerwe commented 2 years ago

Hi @mkoromina, looks like the results are missing a MAF column. Did you run the fine-mapping results on a file that was processed by munge_sumstats.py?

The simplest direct solution is to omit the --adjust-beta-freq flag from the invocation of aggregate_finemapper_results.py, but this would mean that the aggregated results would represent per-normalized-allele effects instead of per-allele effects. If you want per-allele effects, you can take the aggregated results file and divide BETA_MEAN and BETA_SD of each SNP by sqrt(2*f*(1-f), where f is the MAF of that SNP...

If you have the computational capacity, maybe it's better to rerun the munge_sumstats.py script and make sure that it gets a MAF column as input. But then you'll have to rerun all your fine-mapping jobs...

Sorry there's no simpler solution, I hope you can find an easy way to solve this! Please let me know if you'd like me to explain more!

mkoromina commented 2 years ago

Hi @omerwe,

Many thanks for this! I used the munge_sumstats.pyscript to munge my sumstats, however, I used the ridge_constrained files to run the genome-wide fine-mapping jobs. Could it be because of that? Moreover, I ommited the--adjust-beta-freq flag and although the results' aggregation ran with no errors, this led to all betas equaling with 0 after running step 3 (i.e., linearly combining the effect sizes).

Do you know why could this be? Could it be an issue with the headers of the files (i.e variable names)? Many thanks once again!

omerwe commented 2 years ago

@mkoromina your approach seems perfectly fine. The question is why the MAF column is missing. I guess the munge_sumstats.py script couldn't find it? Did you run it with --min-maf 0? Otherwise it should have complained...

all betas=0 likely mean that the phenotype in the small training set from the target population is very very different from the phenotype in the training sets used for running the other methods...

How accurately can you predict if you take the original betas (from the other method and/or from PolyFun) and try to predict the phenotypes of the small training set of the target population? If I'm right, you shouldn't be able to predict them at all... If you can't, then that's the source of the problem. If not, then we'll need to dig deeper :)

mkoromina commented 2 years ago

Hi @omerwe,

Many thanks once again! Truth is that there is not a column named as 'MAF', since the sumstats file follows the standard ricopili format and the names of some of the columns differ. I checked the .logfile and there is a warning about the MAF so I guess this answers why there is no MAF column in the output.

In terms of all betas=0 (hence all PRS=0), I think what you suggested makes perfect sense. Is there though another case example that I could use and apply the pipeline?

Thanking you very much in advance!

omerwe commented 2 years ago

@mkoromina you can check the example provided with PolyFun (as explained in the wiki page). I'm not sure if that's what you meant?

mkoromina commented 2 years ago

Hi @omerwe , I will send you an email (if that's okay) to better outline that! Thanks a lot anyways!