omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
88 stars 22 forks source link

Key Error 'IID'- PolyPred error #111

Closed mkoromina closed 2 years ago

mkoromina commented 2 years ago

Hi @omerwe,

Thanks for providing us with such a useful tool! Coming back to step 3 of the PolyPred pipeline: I am testing my data by using (a) effect sizes from another method (not BOLT-LMM), (b) effect sizes from Polyfun. I am using (c) a bed file and a pheno file from a small subset of the testing cohort which is not included in (a). May I note that (a) is from another method with includes all individuals but the ones which comprise the small subset in (c). Polyfun was run for all assessed individuals (b).

The code I am running is: python /path/to/polypred.py --combine-betas --betas /path/to/other_method.tsv.gz,/path/to/polyfun.txt.gz --pheno /path/to/mypheno.fam --output-prefix /path/to/results/combine_effects --plink-exe /path/to/plink /path/to/my_subset.bed

The full log message and error that I receive is:

Any indication on how I shall fix this or if it could an issue with my input data is more than welcome! Many thanks once again, Maria

omerwe commented 2 years ago

Hi @mkoromina, most like your phenotype file does not have a column with a header called IID. Please look at what the header line should look like in the example files (as described in the wiki).

Hope this helps, please let me know if not!

mkoromina commented 2 years ago

Hi @omerwe ,

Thanks for the quick reply! I did check the headers and a column with the 'IID' can be found. So the columns in my pheno file (edited the .fam file to actually include headers) include FID, IID, a few other ones relevant to .fam files and 'PHENO' column.

Do you have any other potential solutions/suggestions on what could be wrong? Many thanks in advance!

omerwe commented 2 years ago

@mkoromina can you post the header of your phenotypes file (or send it to me at oweissbrod@hsph.harvard.edu)? If you can share, you can run head <pheno_file> and post/send the output.

Another possibility is to physically copy-paste the header line of the example file provided with PolyFun into your own phenotypes file. Maybe you have some extra (possibly hidden) characters in your header line that mess things up?

mkoromina commented 2 years ago

Hi @omerwe,

Many thanks for your recommendation! I will try your suggestions and come back to you, if the issue still persists (it must be a hidden character in the header line). Thanks once again all the useful tips!

mkoromina commented 2 years ago

Hi @omerwe,

Really sorry to re-open this. I fixed the header of the pheno file and upon trying to re-run the above mentioned script, I get the following error message:

Traceback (most recent call last):
  File "/path/to/polyfun/polypred.py", line 434, in <module>
    estimate_mixing_weights(args)
  File "/path/to/polyfun/polypred.py", line 295, in estimate_mixing_weights
    df_prs_sum = computs_prs_all_files(args, betas_file, disable_jackknife=True, keep_file=args.pheno)
  File "/path/to/polyfun/polypred.py", line 239, in computs_prs_all_files
    keep_file=keep_file
  File "/path/to/polyfun/polypred.py", line 83, in compute_prs_for_file
    raise ValueError('No betas found for SNPs in plink file %s'%(plink_file_prefix))
ValueError: No betas found for SNPs in plink file /path/to/cohort1.bed

Do you know what could be wrong in this instance? Many thanks!

p.s= Just to restate some criteria to what I am using: (a) effect sizes from another method (not BOLT-LMM), (b) effect sizes from Polyfun. I am using (c) a bed file and a pheno file from a small subset of the testing cohort which is not included in (a). May I note that (a) is from another method with includes all individuals but the ones which comprise the small subset in (c). Polyfun was run for all assessed individuals (b).

omerwe commented 2 years ago

Hi @mkoromina, The code can't find any sumstats for the SNPs in your bim file. Are you sure they have the same chromosome and allele encodings?

If you want, please post a few lines from your .bim file and from your sumstats file that you think should match the same SNPs, and we'll try to figure out why the code thinks they're different SNPs. (please note that the code doesn't use rsids to identify SNPs because they're not unique; it uses SNP positions and alleles)

mkoromina commented 2 years ago

Hi @omerwe,

Sure, I am attaching below some lines corresponding to certain SNPs from my sumstats file (from the 'other method') and the respective info for these from the .bim file.

-sumstats file

sumstats_file_chr1

-bim file bim_file_chr1

If there is any extra information that is needed for trouble-shooting, just let me know. Many thanks!

omerwe commented 2 years ago

@mkoromina unfortunately I can't easily figure out what's the source of the problem. If you want, you can send a small reproducible example to oweissbrod@hsph.harvard.edu and I'll try to figure it out...

mkoromina commented 2 years ago

Hi @omerwe ,

I think there may be something off with the .bim , i.e., data not being properly parsed. I can definitely though try and create a small example and send It to you!

Many thanks!!