omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
89 stars 22 forks source link

error with ldscore files when using polyfun --compute-h2-L2 #55

Closed cmlakhan closed 3 years ago

cmlakhan commented 3 years ago

I am trying to run the polyfun --compute-h2-L2 command using your baseline UKB annotation and a set of annotations I have created myself. I used the SNPs from your baseline annotation to create my own annotation. Initially, I was getting an error because for chromosomes 6, 8, and 11 there seemed to be issues with the calculation of the LD score based on the annotations. While it created the ldscore files the log files showed these errors below:

[WARNING]  Removing 62677 SNPs from long-range LD region on chromosome 11 BP 46000000-57000000
[WARNING]  Only 6625/17164 SNPs in chromosome 11 BP 45000001-48000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Removing 62677 SNPs from long-range LD region on chromosome 11 BP 46000000-57000000
[WARNING]  Only 6625/17164 SNPs in chromosome 11 BP 45000001-48000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Removing 99700 SNPs from long-range LD region on chromosome 6 BP 25500000-33500000
[WARNING]  Only 12641/23058 SNPs in chromosome 6 BP 24000001-27000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Only 18850/24200 SNPs in chromosome 6 BP 33000001-36000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Removing 99700 SNPs from long-range LD region on chromosome 6 BP 25500000-33500000
[WARNING]  Only 12641/23058 SNPs in chromosome 6 BP 24000001-27000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Only 18850/24200 SNPs in chromosome 6 BP 33000001-36000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Removing 39057 SNPs from long-range LD region on chromosome 8 BP 8000000-12000000
[WARNING]  Removing 39057 SNPs from long-range LD region on chromosome 8 BP 8000000-12000000
[WARNING]  Removing 39057 SNPs from long-range LD region on chromosome 8 BP 8000000-12000000
[WARNING]  Only 17020/26760 SNPs in chromosome 8 BP 6000001-9000001 have annotations info. This may severely down-bias the LD-scores
[WARNING]  Removing 39057 SNPs from long-range LD region on chromosome 8 BP 8000000-12000000
[WARNING]  Only 17020/26760 SNPs in chromosome 8 BP 6000001-9000001 have annotations info. This may severely down-bias the LD-scores

The result was the ldscore files for chromosomes 6, 8, and 11 had fewer SNPs than the annotation. I subsequently fixed this by reprocessing my 3 ldscore files to include all of the SNPs from your baseline ldscore files for chromosomes 6, 8, and 11. This results in the ldscore files from your baseline annotation and my new annotation to have the same number of SNPs. However, I am still getting an error. From my debugging it seems that I am getting an error when the ldscore_fromlist function in the 'parse.py' file is being called. The error seems to occur at the step where it is trying to append the second annotation to the baseline annotation. The error seems to occur at this line

if (not series_eq(y.SNP, ldscore_array[0].SNP)) or (not series_eq(y.index, ldscore_array[0].index)):

The error I get is the following:

Traceback (most recent call last):
  File "/gpfs/commons/home/clakhani/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/gpfs/commons/home/clakhani/.conda/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/common.py", line 65, in new_method
    return method(self, other)
  File "/gpfs/commons/home/clakhani/.conda/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 365, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects

which is due to series_eq(y.SNP, ldscore_array[0].SNP)

It is unclear why this might be the case. When I look at y.SNP it looks like this

y.SNP
snpid
1.10177.A.AC                                             rs367896724
1.10352.T.TA                                             rs201106462
1.10511.A.G                                              rs534229142
1.10616.CCGCCGTTGCAAAGGCGCGCCG.C    1:10616_CCGCCGTTGCAAAGGCGCGCCG_C
1.11008.C.G                                              rs575272151
                                                  ...               
22.51239794.A.C                                          rs561893765
22.51240084.C.G                                          rs529322970
22.51240820.C.T                                          rs202228854
22.51241386.C.G                                          rs568168135
22.51244237.C.T                                          rs575160859
Name: SNP, Length: 19386297, dtype: object

and when I look at ldscore_array[0].SNP it looks like this

snpid
1.10177.A.AC                                             rs367896724
1.10352.T.TA                                             rs201106462
1.10511.A.G                                              rs534229142
1.10616.CCGCCGTTGCAAAGGCGCGCCG.C    1:10616_CCGCCGTTGCAAAGGCGCGCCG_C
1.11008.C.G                                              rs575272151
                                                  ...               
22.51239794.A.C                                          rs561893765
22.51240084.C.G                                          rs529322970
22.51240820.C.T                                          rs202228854
22.51241386.C.G                                          rs568168135
22.51244237.C.T                                          rs575160859
Name: SNP, Length: 19386297, dtype: object
y.SNP.head()
snpid
1.10177.A.AC                                             rs367896724
1.10352.T.TA                                             rs201106462
1.10511.A.G                                              rs534229142
1.10616.CCGCCGTTGCAAAGGCGCGCCG.C    1:10616_CCGCCGTTGCAAAGGCGCGCCG_C
1.11008.C.G                                              rs575272151
Name: SNP, dtype: object

I am curious if you know why this might be the case? If it helps I can provide you the annotation and the summary statistics I used to run my analysis.

omerwe commented 3 years ago

Hi Chirag,

Thanks for the bug report. Can you please prepare a small (minimal) use example that demonstrates the problem? If you can do this and send to oweissbrod@hsph.harvard.edu, this will be a great help.

Best,

Omer

cmlakhan commented 3 years ago

Sure, will send you an email today.

omerwe commented 3 years ago

Hi Chirag,

My apologies for the delay in getting back to you.

I looked into the errors, and found that there were differences in the order of the alleles of multi-allelic SNPs between the two files. For example, for SNP rs62639980 in chr1, the first pair of alleles in the BaselineLD files is A:G, and the second is C:G. However, the order is reversed in your files. I know that this is a silly mistake, but unfortunately the code requires the exact same order of SNPs, including the identities of their alleles.

Incidentally, the code had another problem where the error message wasn't communicated clearly. I fixed this bug, so hopefully the error message will be clearer in the future.

I hope this answers your question, please let me know if not!

Best,

Omer