Closed bschilder closed 4 years ago
Thanks Brian, great idea!!! I'll test this soon and will merge this if there's no problem
On Mon, Dec 9, 2019 at 4:08 PM Brian M. Schilder notifications@github.com wrote:
Hey Omer,
I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the load_ld() function in run_finemapper.py to be able to download LD directly from the alkesgroup server.
I also added a helper function find_ld_prefix() to identify which LD file you need based on some info.
Just wanted to share these in case you found them helpful to integrate into PolyFun.
Best, Brian
import numpy as np import pandas as pd import logging import os import scipy.sparse as sparse from io import BytesIO import requests
def find_ld_prefix(chrom, min_pos):
min_pos = 40590585; chrom=12
alkes_url = "https://data.broadinstitute.org/alkesgroup/UKBB_LD" bp_starts = list(range(1, 252000001, 1000000)) bp_ends = [x+3000000 for x in bp_starts] i = [i for i,x in enumerate(bp_starts) if x>=min_pos][0] prefix = "chr%d_%d_%d" % (chrom, bp_starts[i], bp_ends[i]) ld_prefix = os.path.join(alkes_url, prefix) return ld_prefix
def load_ld(ld_prefix, server=True):
ld_prefix="https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr12_41000001_44000001"
# load SNPs info snps_filename_parquet = ld_prefix + '.parquet' snps_filename_gz = ld_prefix + '.gz' if os.path.exists(snps_filename_parquet) | server==False: df_ld_snps = pd.read_parquet(snps_filename_parquet) elif os.path.exists(snps_filename_gz) | server==True: df_ld_snps = pd.read_table(snps_filename_gz, delim_whitespace=True) else: raise ValueError('couldn\'t find SNPs file %s or %s' % (snps_filename_parquet, snps_filename_gz)) # load LD matrix R_filename = ld_prefix + '.npz' if server: print("Reading .npz file from alkesgroup server") r = requests.get(R_filename, stream=True) R = sparse.load_npz(BytesIO(r.raw.read())).toarray() else: if not os.path.exists(R_filename): raise IOError('%s not found' % (R_filename)) R = sparse.load_npz(R_filename).toarray() R = R + R.T assert np.allclose(np.diag(R), 1.0) logging.info('Loaded an LD matrix for %d SNPs from %s' % (R.shape[0], R_filename)) # sanity checks assert R.shape[0] == R.shape[1] if R.shape[0] != df_ld_snps.shape[0]: raise ValueError('LD matrix has a different number of SNPs than the SNPs file') return R, df_ld_snps
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/17?email_source=notifications&email_token=ACNCB4ZL56IPMLVDX3PZ3ZDQX2XT7A5CNFSM4JYSH4PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7IOXOA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB46MOCE2VMURFI7UXNDQX2XT7ANCNFSM4JYSH4PA .
Along these lines, I noticed that some of the programmatic LD downloads from https://data.broadinstitute.org/alkesgroup/UKBB_LD/ were failing. This seems to be because some LD files are missing (e.g. chr6_29000001_32000001).
I also noticed that some npz files have the extension .npz2
instead of .npz
. Is there a reason for this? I'm getting around it for now by writing in a try/except function that uses .npz
first and then .npz2
.
Good catch!
These all have to do with long-range LD regions. We noticed there were a few extremely long-range LD regions where both SuSiE and FINEMAP had many false-positive results. I excluded these regions and marked the files with suffix .npz2 (in case we ever want them anyway). The missing files on chr6 are in the MHC region. I never computed the LD of this region because its LD is known to be extremely complex and messy. Overall I suggest to not apply fine-mapping to these regions, or at least to treat the results with a huge grain of salt. I'll add a notice about this to the FAQ. Thanks for alerting me to this!
On Wed, Dec 18, 2019 at 5:36 PM Brian M. Schilder notifications@github.com wrote:
Along these lines, I noticed that some of the programmatic LD downloads from https://data.broadinstitute.org/alkesgroup/UKBB_LD/ were failing. This seems to be because some LD files are missing (e.g. chr6_29000001_32000001).
I also noticed that some npz files have the extension .npz2 instead of .npz. Is there a reason for this? I'm getting around it for now by writing in a try/except function that uses .npz first and then .npz2.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/17?email_source=notifications&email_token=ACNCB44PHT5YYFQON4G55LLQZKQWHA5CNFSM4JYSH4PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHXHDY#issuecomment-567243663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB426FVGU4BJCK6JF3FLQZKQWHANCNFSM4JYSH4PA .
That totally makes sense now. Thanks for the info!
I mapped these long range LD regions to hg38 and it seems the high LD region on chromosome 11 is just very weird. It ends up mapping to multiple chromosomes. The regions on chromosome 6 and 8 seem to map back to their corresponding chromosomes.
chr20 28534739 28534795 chr11:47000002-57000001 1
chr11 50809938 50809993 chr11:47000002-57000001 2
chr2 94531009 94531062 chr11:47000002-57000001 3
chr11 54525074 55026708 chr11:47000002-57000001 4
chr12 34504916 34504974 chr11:47000002-57000001 5
chr11 55104000 55104074 chr11:47000002-57000001 6
chr6 25999773 30032224 chr6:26000002-30000001 1
chr6 26726947 26726981 chr6:26000002-30000001 2
chr6 31032224 35032224 chr6:31000002-35000001 1
chr8 8142479 12142492 chr8:8000002-12000001 1
Interesting! I'm wondering if the mapping procedure itself is sensitive to long-range LD...
Maybe it was specific to hg38, it seems that you don't encounter this issue when you map to T2T.
chr11 47135320 57181970 chr11:47000003-57000001 1
chr11 49892316 49905511 chr11:47000003-57000001 2
chr11 54461100 88827168 chr11:47000003-57000001 3
chr11 50315348 50538592 chr11:47000003-57000001 4
chr11 55604089 55604167 chr11:47000003-57000001 5
chr6 25865714 29896466 chr6:26000003-30000001 1
chr6 26607305 26630374 chr6:26000003-30000001 2
chr6 57699849 57699906 chr6:26000003-30000001 3
chr6 29719155 29719279 chr6:26000003-30000001 4
chr6 29719031 29719155 chr6:26000003-30000001 5
chr6 29800114 29800194 chr6:26000003-30000001 6
chr6 30897231 34855024 chr6:31000003-35000001 1
chr6 30908641 30909510 chr6:31000003-35000001 2
chr6 30912894 30912981 chr6:31000003-35000001 3
chr6 31099280 31099516 chr6:31000003-35000001 4
chr6 31103775 31104088 chr6:31000003-35000001 5
chr6 31104428 31104594 chr6:31000003-35000001 6
chr6 31104899 31106763 chr6:31000003-35000001 7
chr6 31109345 31110769 chr6:31000003-35000001 8
chr6 32362682 32403756 chr6:31000003-35000001 9
chr6 32412424 32415720 chr6:31000003-35000001 10
chr6 32360455 32360761 chr6:31000003-35000001 11
chr6 32424754 32424833 chr6:31000003-35000001 12
chr6 32574684 32574743 chr6:31000003-35000001 13
chr6 32510176 32510813 chr6:31000003-35000001 14
chr6 33905698 33905727 chr6:31000003-35000001 15
chr8 7632758 11601193 chr8:8000003-12000001 1
chr1 112910837 112910886 chr8:8000003-12000001 2
chr8 9064622 9064706 chr8:8000003-12000001 3
Hey Omer,
I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the
load_ld()
function inrun_finemapper.py
to be able to download LD directly from the alkesgroup server.I also added a helper function
find_ld_prefix()
to identify which LD file you need based on some info.Just wanted to share these in case you found them helpful to integrate into PolyFun.
Best, Brian