omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
89 stars 22 forks source link

LD directly from alkesgroup server #17

Closed bschilder closed 4 years ago

bschilder commented 4 years ago

Hey Omer,

I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the load_ld() function in run_finemapper.py to be able to download LD directly from the alkesgroup server.

I also added a helper function find_ld_prefix() to identify which LD file you need based on some info.

Just wanted to share these in case you found them helpful to integrate into PolyFun.

Best, Brian

import numpy as np
import pandas as pd
import logging
import os
import scipy.sparse as sparse
from io import BytesIO
import requests

def find_ld_prefix(chrom, min_pos):
    # min_pos = 40590585; chrom=12
    alkes_url = "https://data.broadinstitute.org/alkesgroup/UKBB_LD"
    bp_starts = list(range(1, 252000001, 1000000))
    bp_ends = [x+3000000 for x in bp_starts]
    i = max([i for i,x in enumerate(bp_starts) if x<=min_pos])
    prefix = "chr%d_%d_%d" % (chrom, bp_starts[i], bp_ends[i])
    ld_prefix = os.path.join(alkes_url, prefix)
    return ld_prefix

def load_ld(ld_prefix, server=True):
    # ld_prefix="https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr12_41000001_44000001"
    # load SNPs info
    snps_filename_parquet = ld_prefix + '.parquet'
    snps_filename_gz = ld_prefix + '.gz'

    if os.path.exists(snps_filename_parquet) | server==False:
        df_ld_snps = pd.read_parquet(snps_filename_parquet)
    elif os.path.exists(snps_filename_gz) | server==True:
        df_ld_snps = pd.read_table(snps_filename_gz, delim_whitespace=True)
    else:
        raise ValueError('couldn\'t find SNPs file %s or %s' % (snps_filename_parquet, snps_filename_gz))

    # load LD matrix
    R_filename = ld_prefix + '.npz'
    if server:
        print("Reading .npz file from alkesgroup server")
        r = requests.get(R_filename, stream=True)
        R = sparse.load_npz(BytesIO(r.raw.read())).toarray()
    else:
        if not os.path.exists(R_filename):
            raise IOError('%s not found' % (R_filename))
        R = sparse.load_npz(R_filename).toarray()
    R = R + R.T
    assert np.allclose(np.diag(R), 1.0)
    logging.info('Loaded an LD matrix for %d SNPs from %s' % (R.shape[0], R_filename))

    # sanity checks
    assert R.shape[0] == R.shape[1]
    if R.shape[0] != df_ld_snps.shape[0]:
        raise ValueError('LD matrix has a different number of SNPs than the SNPs file')

    return R, df_ld_snps
omerwe commented 4 years ago

Thanks Brian, great idea!!! I'll test this soon and will merge this if there's no problem

On Mon, Dec 9, 2019 at 4:08 PM Brian M. Schilder notifications@github.com wrote:

Hey Omer,

I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the load_ld() function in run_finemapper.py to be able to download LD directly from the alkesgroup server.

I also added a helper function find_ld_prefix() to identify which LD file you need based on some info.

Just wanted to share these in case you found them helpful to integrate into PolyFun.

Best, Brian

import numpy as np import pandas as pd import logging import os import scipy.sparse as sparse from io import BytesIO import requests

def find_ld_prefix(chrom, min_pos):

min_pos = 40590585; chrom=12

alkes_url = "https://data.broadinstitute.org/alkesgroup/UKBB_LD"
bp_starts = list(range(1, 252000001, 1000000))
bp_ends = [x+3000000 for x in bp_starts]
i = [i for i,x in enumerate(bp_starts) if x>=min_pos][0]
prefix = "chr%d_%d_%d" % (chrom, bp_starts[i], bp_ends[i])
ld_prefix = os.path.join(alkes_url, prefix)
return ld_prefix

def load_ld(ld_prefix, server=True):

ld_prefix="https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr12_41000001_44000001"

# load SNPs info
snps_filename_parquet = ld_prefix + '.parquet'
snps_filename_gz = ld_prefix + '.gz'

if os.path.exists(snps_filename_parquet) | server==False:
    df_ld_snps = pd.read_parquet(snps_filename_parquet)
elif os.path.exists(snps_filename_gz) | server==True:
    df_ld_snps = pd.read_table(snps_filename_gz, delim_whitespace=True)
else:
    raise ValueError('couldn\'t find SNPs file %s or %s' % (snps_filename_parquet, snps_filename_gz))

# load LD matrix
R_filename = ld_prefix + '.npz'
if server:
    print("Reading .npz file from alkesgroup server")
    r = requests.get(R_filename, stream=True)
    R = sparse.load_npz(BytesIO(r.raw.read())).toarray()
else:
    if not os.path.exists(R_filename):
        raise IOError('%s not found' % (R_filename))
    R = sparse.load_npz(R_filename).toarray()
R = R + R.T
assert np.allclose(np.diag(R), 1.0)
logging.info('Loaded an LD matrix for %d SNPs from %s' % (R.shape[0], R_filename))

# sanity checks
assert R.shape[0] == R.shape[1]
if R.shape[0] != df_ld_snps.shape[0]:
    raise ValueError('LD matrix has a different number of SNPs than the SNPs file')

return R, df_ld_snps

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/17?email_source=notifications&email_token=ACNCB4ZL56IPMLVDX3PZ3ZDQX2XT7A5CNFSM4JYSH4PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7IOXOA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB46MOCE2VMURFI7UXNDQX2XT7ANCNFSM4JYSH4PA .

bschilder commented 4 years ago

Along these lines, I noticed that some of the programmatic LD downloads from https://data.broadinstitute.org/alkesgroup/UKBB_LD/ were failing. This seems to be because some LD files are missing (e.g. chr6_29000001_32000001).

I also noticed that some npz files have the extension .npz2 instead of .npz. Is there a reason for this? I'm getting around it for now by writing in a try/except function that uses .npz first and then .npz2.

omerwe commented 4 years ago

Good catch!

These all have to do with long-range LD regions. We noticed there were a few extremely long-range LD regions where both SuSiE and FINEMAP had many false-positive results. I excluded these regions and marked the files with suffix .npz2 (in case we ever want them anyway). The missing files on chr6 are in the MHC region. I never computed the LD of this region because its LD is known to be extremely complex and messy. Overall I suggest to not apply fine-mapping to these regions, or at least to treat the results with a huge grain of salt. I'll add a notice about this to the FAQ. Thanks for alerting me to this!

On Wed, Dec 18, 2019 at 5:36 PM Brian M. Schilder notifications@github.com wrote:

Along these lines, I noticed that some of the programmatic LD downloads from https://data.broadinstitute.org/alkesgroup/UKBB_LD/ were failing. This seems to be because some LD files are missing (e.g. chr6_29000001_32000001).

I also noticed that some npz files have the extension .npz2 instead of .npz. Is there a reason for this? I'm getting around it for now by writing in a try/except function that uses .npz first and then .npz2.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/17?email_source=notifications&email_token=ACNCB44PHT5YYFQON4G55LLQZKQWHA5CNFSM4JYSH4PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHXHDY#issuecomment-567243663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNCB426FVGU4BJCK6JF3FLQZKQWHANCNFSM4JYSH4PA .

bschilder commented 4 years ago

That totally makes sense now. Thanks for the info!

cmlakhan commented 11 months ago

I mapped these long range LD regions to hg38 and it seems the high LD region on chromosome 11 is just very weird. It ends up mapping to multiple chromosomes. The regions on chromosome 6 and 8 seem to map back to their corresponding chromosomes.

chr20   28534739    28534795    chr11:47000002-57000001 1
chr11   50809938    50809993    chr11:47000002-57000001 2
chr2    94531009    94531062    chr11:47000002-57000001 3
chr11   54525074    55026708    chr11:47000002-57000001 4
chr12   34504916    34504974    chr11:47000002-57000001 5
chr11   55104000    55104074    chr11:47000002-57000001 6
chr6    25999773    30032224    chr6:26000002-30000001  1
chr6    26726947    26726981    chr6:26000002-30000001  2
chr6    31032224    35032224    chr6:31000002-35000001  1
chr8    8142479         12142492    chr8:8000002-12000001   1
omerwe commented 11 months ago

Interesting! I'm wondering if the mapping procedure itself is sensitive to long-range LD...

cmlakhan commented 11 months ago

Maybe it was specific to hg38, it seems that you don't encounter this issue when you map to T2T.

chr11   47135320    57181970    chr11:47000003-57000001 1
chr11   49892316    49905511    chr11:47000003-57000001 2
chr11   54461100    88827168    chr11:47000003-57000001 3
chr11   50315348    50538592    chr11:47000003-57000001 4
chr11   55604089    55604167    chr11:47000003-57000001 5
chr6    25865714    29896466    chr6:26000003-30000001  1
chr6    26607305    26630374    chr6:26000003-30000001  2
chr6    57699849    57699906    chr6:26000003-30000001  3
chr6    29719155    29719279    chr6:26000003-30000001  4
chr6    29719031    29719155    chr6:26000003-30000001  5
chr6    29800114    29800194    chr6:26000003-30000001  6
chr6    30897231    34855024    chr6:31000003-35000001  1
chr6    30908641    30909510    chr6:31000003-35000001  2
chr6    30912894    30912981    chr6:31000003-35000001  3
chr6    31099280    31099516    chr6:31000003-35000001  4
chr6    31103775    31104088    chr6:31000003-35000001  5
chr6    31104428    31104594    chr6:31000003-35000001  6
chr6    31104899    31106763    chr6:31000003-35000001  7
chr6    31109345    31110769    chr6:31000003-35000001  8
chr6    32362682    32403756    chr6:31000003-35000001  9
chr6    32412424    32415720    chr6:31000003-35000001  10
chr6    32360455    32360761    chr6:31000003-35000001  11
chr6    32424754    32424833    chr6:31000003-35000001  12
chr6    32574684    32574743    chr6:31000003-35000001  13
chr6    32510176    32510813    chr6:31000003-35000001  14
chr6    33905698    33905727    chr6:31000003-35000001  15
chr8    7632758 11601193    chr8:8000003-12000001   1
chr1    112910837   112910886   chr8:8000003-12000001   2
chr8    9064622 9064706 chr8:8000003-12000001   3