Removal of variants from custom annotation LD scores that are in the baseline annotations

gbloeb commented 2 years ago

Hi Omer, When using compute_ldscores_from_ld.py, variants are removed from the long range LD regions that are present in the baseline annotations (I presume because there are slightly different definitions of the long range LD regions that were used to build the baseline annotations). When these custom annotations are then used with the baseline annotations for polyfun.py this error is thrown since the baseline and custom LD scores don't align:

ValueError: LD Scores for concatenation must have identical SNP columns (and A1/A2 columns if such columns exist).

To illustrate this the baseline annotations and LD scores contain the attached ~45,000 variants in chromsome 6 that lie between BP 25500349 33499937. These are explicitly removed when running ~polyfun/compute_ldscores_from_ld.py with following options:

python ~/polyfun/compute_ldscores_from_ld.py \ --annot ~/group/polyfun/custom_annotations/"$ANNOT"."$CHR".annot.parquet \ --out ~/group/polyfun/custom_annotations/"$ANNOT"."$CHR".l2.ldscore.parquet \ --ukb \ --ld-dir ~/group/polyfun/UKB_LD_matrices

as can be seen from output and documented in the log:

[WARNING] Removing 99700 SNPs from long-range LD region on chromosome 6 BP 25500000-33500000

I think that I can just remove these variants from the baseline LD files to get around this issue. Can you confirm that I do not need to change the .l2.M , weights, or annotation files?

Thank you!

chr6_longRangeLDvariants_inbaseline.txt

omerwe commented 2 years ago

Thanks @gbloeb for the bug report!

Your solution sounds good. You don't need to change any of the other files.

I think the best solution is to modify the definitions of the long-range LD regions in polyfun_utils.py so that it's consistent with the existing annotation files: https://github.com/omerwe/polyfun/blob/9efb110b505aa5fe89c91af6b1a0fa212d24816c/polyfun_utils.py#L11

It's been a few years since I worked on this, but I guess I had slightly different definitions back then. I decided to use nice "round" numbers when I rewrote this code, but I didn't realize it would mess up the concatenation...

Unfortunately, I have almost zero bandwidth to work on this... If you can check this, can you please verify that every SNP found in the baseline annotation files is also found in the annotation files created by polyfun_utils.py? In that case, we only need to modify the region definition to go from 25500349 to 33499937. If you have the bandwidth to change the code (in polyfun_utils.py, line 11) and verify that this resolves the issue, that would be awesome.

Thanks so much,

Omer

omerwe commented 2 years ago

@gbloeb thanks for sending me a reproducible example. I modified the code to not omit long-range LD regions from the LD-scores computation, so that the resulting files are now consistent with the published Baseline-LF files. I realize that you already recomputed the LD scores for all annotations jointly, but from now on you should be able to compute these only for your new annotations.

If you think you may need to compute new LD scores in the near future, it would be great if you could git pull the latest code and let me know if the problem is fixed. In any case, please let me know if you think we can close this issue.

omerwe / polyfun

Removal of variants from custom annotation LD scores that are in the baseline annotations #101