functional annotations in hg38 version

cybluetree commented 4 months ago

Thank you so much for developing this wonderful tool. I'm wondering if it's possible to also provide functional annotations (~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations) in hg38 version. Thanks!

omerwe commented 4 months ago

@cybluetree I don't have the bandwidth to do this, but if someone could do the conversion (using a tool like LiftOver) I'll be happy to link to the updated annotations. Sorry I can't help more...

simonlee184 commented 6 days ago

Hi Omer, I was running polyfun but my data is also on hg38. Because the baseline-LF UKB annotations were in hg19, I annotated the rsIDs in biomaRt to get the hg38 positions, but after doing so, I received this error here:

ValueError: After merging with reference panel LD, 0 SNPs remain. Please make sure that your annotation files include the SNPs in your sumstats files (please see the PolyFun wiki for details on downloading functional annotations)

I've checked the the variant positions exist in both the reference panel and in my data but for some reason, there is a merging error. Would you know why this error might happen?

mkoromina commented 5 days ago

Following from @simonlee184 message and since we were working on lifting over the baseline annotations from hg19 to hg38, this is the full log of the error we received upon trying run the s-ldsc step from polyfun.


[INFO]  Error parsing reference panel LD Score.
Traceback (most recent call last):
  File "polyfun/polyfun.py", line 851, in <module>
    polyfun_obj.polyfun_main(args)
  File "polyfun/polyfun.py", line 774, in polyfun_main
    self.polyfun_h2_L2(args)
  File "polyfun/polyfun.py", line 596, in polyfun_h2_L2
    self.run_ldsc(args, use_ridge=True, nn=False, evenodd_split=False, keep_large=False)
  File "polyfun/polyfun.py", line 179, in run_ldsc
    M_annot, w_ld_cname, ref_ld_cnames, df_sumstats, _ = sumstats._read_ld_sumstats(args, log, args.h2)
  File "/sc/arion/projects/scratch/polyfun/ldsc_polyfun/sumstats.py", line 254, in _read_ld_sumstats
    ref_ld = _read_ref_ld(args, log)
  File "/sc/arion/projects/scratch polyfun/ldsc_polyfun/sumstats.py", line 86, in _read_ref_ld
    ref_ld = _read_chr_split_files(args.ref_ld_chr, args.ref_ld, log,
  File "/sc/arion/projects/scratch/polyfun/ldsc_polyfun/sumstats.py", line 163, in _read_chr_split_files
    raise e
  File "/sc/arion/projects/scratch/polyfun/ldsc_polyfun/sumstats.py", line 160, in _read_chr_split_files
    out = parsefunc(_splitp(chr_arg), _N_CHR, **kwargs)
  File "/sc/arion/projects/scratch/polyfun/ldsc_polyfun/parse.py", line 128, in ldscore_fromlist
    y = ldscore(fh, num)
  File "/sc/arion/projects/scratch/polyfun/ldsc_polyfun/parse.py", line 230, in ldscore
    raise ValueError(error_msg)
ValueError: Duplicate SNPs were found in the input data:
Index(['1.0.43.0.C.T', '1.0.91.0.C.T', '1.0.206.0.AAT.A', '1.0.224.0.C.G',
       '1.0.278.0.A.G', '1.0.307.0.A.C', '1.0.366.0.C.T', '1.0.375.0.A.G',
       '1.0.402.0.A.G', '1.0.519.0.G.T',
       ...
       '22.0.147420.0.C.T', '22.0.148742.0.C.T', '22.0.154354.0.A.G',
       '22.0.157574.0.C.T', '22.0.165008.0.C.T', '22.0.166362.0.C.T',
       '22.0.172974.0.C.T', '22.0.174736.0.C.T', '22.0.177860.0.C.T',
       '22.0.179918.0.C.G'],

@omerwe, could it be that we will need to recompute the ldscores from scratch after annotating the annotations with the bp from hg19 to hg38? Kindly let us know if there is more info needed to debug this.

omerwe commented 4 days ago

@simonlee184 I'm not sure I understand all the steps that you did, but it looks like there are still systematic discrepancies between the annotation files and the sumstats files. I'm afraid you'll have to see what these systematic discrepancies are...

PolyFun treats two SNPs as the same SNP only if they share (1) chromosome; (2) position; (3) reference allele; (4) alternative allele. If two SNPs differ in at least one of these, they're considered as different. One thing you need to check is if the reference and alternative allele aren't swapped between the annotations and the sumstats. If you want to share a small snapshot of what the data looks like, I can help figure it out.

omerwe commented 4 days ago

@mkoromina There are duplicate SNPs in your LDscore files... The error message also suggests that the LD-score files include positions that look like "0.43.0". I'm not sure what these positions are? Can you share a small snapshot of a few lines from your LD-score files?

mkoromina commented 3 days ago

thanks @omerwe for the quick response. This is the header of one of the LD score files when we parse it in R. We are checking with @simonlee184 if this is down to issues from the liftover and/or issues when writing these files as ldscore.txt.gz (perhaps erroneous characters are entered upon file conversion?). If you have any additional thoughts, we'd be more than happy to further dig into it. Screen Shot 2024-11-21 at 10 25 52 AM

Thank you!

omerwe commented 2 days ago

@mkoromina I just realized what might be going on. It looks the code interprets the chromosome and SNP positions as floats rather than int, so it adds ".0". I'm not sure why it happens (probably some pandas version mismatch), but I just git pushed a fix that should hopefully resolve this. Can you please git pull and let me know if it helps?

mkoromina commented 2 days ago

Thanks for the quick response! We made some progress with this fix you added; thanks for this. However, we stumbled upon the following error:

█████████████████████████████████| 2/2 [00:17<00:00,  8.75s/it]
[INFO]  Computing per-SNP h^2 for each chromosome...
 18%|███████████████████████████████████████████████████▎
                                | 4/22 [02:49<12:43, 42.44s/it]
Traceback (most recent call last):
  File "polyfun/polyfun.py", line 851, in <module>
    polyfun_obj.polyfun_main(args)
  File "polyfun/polyfun.py", line 774, in polyfun_main
    self.polyfun_h2_L2(args)
  File "polyfun/polyfun.py", line 599, in polyfun_h2_L2
    self.compute_snpvar(args, use_ridge=True)
  File "polyfun/polyfun.py", line 354, in compute_snpvar
    df_snpvar_chr = self.compute_snpvar_chr(args, chr_num, use_ridge=use_ridge)
  File "polyfun/polyfun.py", line 313, in compute_snpvar_chr
    raise ValueError('not all chromosomes have a taus estimate - please make sure that the intersection of SNPs with sumstats and with annotations data spans all chromosomes')
ValueError: not all chromosomes have a taus estimate - please make sure that the intersection of SNPs with sumstats and with annotations data spans all chromosomes

We checked the annotations files and the sumstats and chromosomes are coded the same way (ie making sure that we do not have the chr prefix in the CHR column). We also checked for an overlap amongst SNPs from the annotations against our GWAS and although there was no incomplete overlap, the values from 197k to 1.15mil SNPs depending on the chromosome. Do you have any ideas as to how we could debug this error?

Thanks so much for all your help ~ we are one step closer to the successful liftover and subsequent use of the hg38 baseline annotations. Many thanks!

omerwe commented 1 day ago

@mkoromina probably there's some component that parses the chromosomes differently in different parts of the code... I'm really not sure why, I'm afraid I'll need a small reproducible example for this. If you can share a simple example with just a few SNPs (e.g. 10-20 SNPs) from each chromosome, for both the sumstats and the ld-score files, I can try to diagnose this... Please feel free to share with my email (omer.we@gmail.com).

omerwe / polyfun

functional annotations in hg38 version #199