omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
89 stars 22 forks source link

Error when running polyfun with multiple custom annotations #104

Closed freg1086 closed 2 years ago

freg1086 commented 2 years ago

Hi Omer, I have a problem when using multiple custom annotation files.

So I try to compute polyfun priors for a sumstats file using the baselineLF2.2 combined with my custom annotations. Each custom annotation is a separate file, i.e., I have each annotation in a a separate folder with annot.gz, l2.ldscore.gz and l2.M files for every chromosome (1-22). A total of 14 custom annotations + 1 baseline.

Here is an extract from my launch code :

refld="baseline/baselineLF2.2.UKB." refld2="annotA/annotA.,annotB/annotB.,annotC/annotC. # and so on

python polyfun/polyfun.py \
    --compute-h2-L2 \
    --output-prefix $OUTFILE \
    --sumstats $SUMSTATS \
    --ref-ld-chr $refld, $refld2 \
    --w-ld-chr baselineLF2.2.UKB/weights.UKB. \
    --allow-missing

Which gives the following output :

`[INFO]  Reading summary statistics from sumstats.munged.parquet ...
[INFO]  Read summary statistics for 15142964 SNPs.
[INFO]  Reading reference panel LD Score from baseline/baselineLF2.2.UKB.,annotA/annotA.,annotB/annotB.,annotC/annotC.[1-22]...

 82%|███████▋  | 17/22 [03:25<00:37,  7.59s/it]
100%|██████████| 22/22 [03:48<00:00, 10.39s/it]
 82%|███████▋  | 17/22 [01:19<00:15,  3.14s/it]
100%|██████████| 22/22 [01:29<00:00,  4.06s/it]
 82%|███████▋  | 17/22 [01:19<00:15,  3.11s/it]
100%|██████████| 22/22 [01:29<00:00,  4.05s/it]
 82%|███████▋  | 17/22 [01:16<00:15,  3.03s/it]
100%|██████████| 22/22 [01:26<00:00,  3.93s/it]
 82%|███████▋  | 17/22 [01:18<00:17,  3.43s/it]
100%|██████████| 22/22 [01:27<00:00,  3.99s/it]
 82%|███████▋  | 17/22 [01:16<00:15,  3.06s/it]
100%|██████████| 22/22 [01:26<00:00,  3.93s/it]
 82%|███████▋  | 17/22 [01:18<00:15,  3.17s/it]
100%|██████████| 22/22 [01:28<00:00,  4.02s/it]
 82%|███████▋  | 17/22 [01:18<00:15,  3.06s/it]
100%|██████████| 22/22 [01:28<00:00,  4.01s/it]
 82%|███████▋  | 17/22 [01:17<00:15,  3.08s/it]
100%|██████████| 22/22 [01:26<00:00,  3.94s/it]
 82%|███████▋  | 17/22 [01:18<00:15,  3.14s/it]
100%|██████████| 22/22 [01:27<00:00,  4.00s/it]
 82%|███████▋  | 17/22 [01:17<00:15,  3.06s/it]
100%|██████████| 22/22 [01:27<00:00,  3.98s/it]
 82%|███████▋  | 17/22 [01:18<00:15,  3.16s/it]
100%|██████████| 22/22 [01:28<00:00,  4.00s/it]
 82%|███████▋  | 17/22 [01:17<00:15,  3.05s/it]
100%|██████████| 22/22 [01:27<00:00,  3.97s/it]
 82%|███████▋  | 17/22 [01:16<00:15,  3.15s/it]
100%|██████████| 22/22 [01:26<00:00,  3.94s/it]
 82%|███████▋  | 17/22 [01:19<00:15,  3.13s/it]
100%|██████████| 22/22 [01:29<00:00,  4.05s/it]
 82%|███████▋  | 17/22 [02:10<00:25,  5.12s/it]
100%|██████████| 22/22 [02:26<00:00,  6.67s/it]
[INFO]  Read reference panel LD Scores for 19275186 SNPs.
[INFO]  Reading regression weight LD Score from baselineLF2.2.UKB/weights.UKB.[1-22] ...
 82%|███████▋  | 17/22 [01:04<00:12,  2.58s/it]
100%|██████████| 22/22 [01:12<00:00,  3.31s/it]
[INFO]  Read regression weight LD Scores for 18275613 SNPs.
[INFO]  After merging with reference panel LD, 15102863 SNPs remain.
[INFO]  After merging with regression SNP LD, 15073982 SNPs remain.
[INFO]  Removed 3807 SNPs with chi^2 > 685.885 (15070175 SNPs remain)
[INFO]  iterating over chromosomes to compute XTX, XTy...
 82%|███████▋  | 17/22 [02:20<00:23,  4.80s/it]
100%|██████████| 22/22 [02:33<00:00,  6.98s/it]
[INFO]  Evaluating Ridge lambdas...
 19%|█▉        | 19/100 [00:11<00:39,  2.04it/s]
 38%|███▊      | 37/100 [00:19<00:31,  2.02it/s]
 55%|█████▌    | 55/100 [00:30<00:23,  2.00it/s]
 71%|███████   | 71/100 [00:38<00:14,  2.05it/s]
 87%|████████▌ | 86/100 [00:45<00:06,  2.01it/s]
100%|██████████| 100/100 [00:54<00:00,  1.82it/s]
[INFO]  Selected ridge lambda: 9.3797e-02 (70/100)  score: 4.2854e-02  score lstsq: 4.1203e-02
100%|██████████| 2/2 [00:53<00:00, 26.85s/it]
[INFO]  Estimating annotation coefficients for each chromosomes set
[INFO]  Computing per-SNP h^2 for each chromosome...
  0%|          | 0/22 [00:46<?, ?it/s]
Traceback (most recent call last):
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 848, in <module>
    polyfun_obj.polyfun_main(args)
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 772, in polyfun_main
    self.polyfun_h2_L2(args)
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 597, in polyfun_h2_L2
    self.compute_snpvar(args, use_ridge=True)
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 352, in compute_snpvar
    df_snpvar_chr = self.compute_snpvar_chr(args, chr_num, use_ridge=use_ridge)
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 301, in compute_snpvar_chr
    df_annot_chr = self.load_annotations_file(args, chr_num, use_ridge)
  File "/project/gazal_569/artem/bin/polyfun/polyfun.py", line 287, in load_annotations_file
    raise ValueError('Annotation names in annotations file do not match the one in the LD-scores file')
ValueError: Annotation names in annotations file do not match the one in the LD-scores file`

From what I understand, the error comes from line 287 of polyfun.py :

` #if we have more annotations that ref-ld, it might mean that some annotations were removed, so remove them from here as well

    if not np.all(np.isin(self.ref_ld_cnames, df_annot_chr.columns)):
        raise ValueError('Annotation names in annotations file do not match the one in the LD-scores file')
    if len(self.ref_ld_cnames) < len(df_annot_chr.columns) - len(SNP_COLUMNS):            
        df_annot_chr = df_annot_chr[SNP_COLUMNS + self.ref_ld_cnames]`

Not really sure, where the error comes from. I checked the column names and numbers and the correspondence between annot.gz and ldscore.gz files, everything seems to be ok.

Sorry for the long post, and thank you for your work.

omerwe commented 2 years ago

Hi @freg1086, to make this easier to debug I modified the code to print a clearer error message that shows exactly which columns are supposedly missing. Can you please git pull the code and rerun this? Please let me know if this helps you figure out what's wrong, or if you still can't figure it out.

freg1086 commented 2 years ago

Hi Omer, thanks for the code modifications. I reran and got this as an error message :

Traceback (most recent call last):
  File "/bin/polyfun/polyfun.py", line 849, in <module>
    polyfun_obj.polyfun_main(args)
  File "/bin/polyfun/polyfun.py", line 773, in polyfun_main
    self.polyfun_h2_L2(args)
  File "/bin/polyfun/polyfun.py", line 598, in polyfun_h2_L2
    self.compute_snpvar(args, use_ridge=True)
  File "/bin/polyfun/polyfun.py", line 353, in compute_snpvar
    df_snpvar_chr = self.compute_snpvar_chr(args, chr_num, use_ridge=use_ridge)
  File "/bin/polyfun/polyfun.py", line 302, in compute_snpvar_chr
    df_annot_chr = self.load_annotations_file(args, chr_num, use_ridge)
  File "/bin/polyfun/polyfun.py", line 287, in load_annotations_file
    missing_annot = self.ref_ld_cnames[~np.isin(self.ref_ld_cnames, df_annot_chr.columns)]
TypeError: only integer scalar arrays can be converted to a scalar index

Still having trouble figuring the problem out, do you have any ideas?

Also, my annotation names contain dashes -. Could that interfere with the Polyfun reading process ?

omerwe commented 2 years ago

Hi @freg1086,

Sorry, I didn't test my new code properly. I just fixed this error --- can you please git pull and retry?

freg1086 commented 2 years ago

Hi Omer, thanks for the corrections.

Here is the output error of my latest attempt :

raise ValueError('The following annotations have LD-scores but are not in any of the annotation files: %s'%(missing_annot)) ValueError: The following annotations have LD-scores but are not in any of the annotation files: [ 'zoonomia_mam_1_common_', 'zoonomia_mam_2_lowfreq_', 'zoonomia_mam_2_common_', 'zoonomia_mam_3_lowfreq_', 'zoonomia_ mam_3_common_', 'zoonomia_mam_4_lowfreq_', 'zoonomia_mam_4_common_', 'zoonomia_mam_5_lowfreq_', 'zoonomia_mam_5_common_', 'zoonomia_pri_1_lowfreq_', 'zoonomia_pri_1_common_', 'zoonomia_pri_2_lowfreq_', 'zoonomia_pri_2_common_', 'zoonomia_pri_3_lowfreq_', 'zoonomia_pri_3_common_', 'zoonomia_pri_4_lowfreq_', 'zoonomia_pri_4_common_', 'zoonomia_pri_5_lowfreq_', 'zoonomia_pri_5_common_']

As you can see, the issue seems to be the name of my annotation files. for some reason, Polyfun appends '_' to the end of some of my annotation names (others seem to work fine) and then proceeds to not finding them...I checked the header of these files as well as my inpt paths, nothing seems to be wrong...any ideas ?

Thank you for your help and sorry for all the trouble.

omerwe commented 2 years ago

Hi @freg1086,

Great, we're making progress. If you could create a small reproducible example and send to me (oweissbrod@hsph.harvard.edu) I'll be better able to assist.

omerwe commented 2 years ago

@freg1086 thanks for sending me the files. I think I found and fixed a bug that caused this. In a nutshell, this was caused by having >9 different annotation files, I guess you're the first to try that :)

Can you please try to git pull the latest code and see if it works now?

freg1086 commented 2 years ago

Hey Omer, it seems to work now, thank you very much for your help ! :)

omerwe commented 2 years ago

Great! Can we close the issue?

freg1086 commented 2 years ago

Thanks Omer! Yes, I'm closing this one.