Open JoniColeman opened 3 years ago
Hi Joni,
I can add support for .bcor files but I need to think about what the interface should look like. Can you please explain how the data is arranged in your case? e.g. do you have a single chromosome-wide bcor file, or a bunch of bcor files that cover an entire chromosome?
The latter, which may not be ideal. I have a .bcor for each 3Mb segment in the genome, akin to what you've generated for UK Biobank (although in a different format obviously)
Hi, I added support for computing LD-scores from an arbitrary set of input .bcor files. Please see the end of the updated wiki page for details, and please let me know if it works for you!
Cheers,
Omer
This seems to work for generating LD scores - thank you!
Hi, sorry to reopen this one, but I've run into a memory issue with the script - it works fine for most of my data, but running on chr 1-4 takes more than 5 days on a 90Gb node. The script runs fine - it's just that the read-in of the bcors takes too long.
I have a couple of solutions that I'd like to explore:
Conversion of .bcor to .npz files - the UK Biobank data default in PolyFun seems to run efficiently, so I'm wondering if converting the .bcor files from LDStore into .npzs might work. Would the script be adaptable to running with .npzs rather than .bcors? Splitting the chromosomes at the centromere -there shouldn't be long-range LD passing across the centromere, so I think the LD score calculation could be run for half-chromosomes in this way. Would it be feasible to create LD score files per half-chromosome and then combine them post-hoc? Do either of these solutions sound like they'd be feasible? If so, would it be possible to alter the script to enable them? Thanks again - sorry to keep asking you to change your software to fit my needs!
Hi Joni,
Not a problem, but I'm surprised that the loading of .bcor files is the memory bottleneck. .bcor files are packed pretty efficiently, I don't think they have any significant disadvantage vs .npz files. Could it be that you simply have more SNPs per LD region than in our UKB data? We applied a MAF>0.1% filter, which left us with ~20K SNPs per region. If that's the case, simple solutions are (1) apply some more stringent SNP filter; or (2) use smaller regions (i.e. <3Mb long). The first option in particular should be easy to try out --- you won't have to recompute the LD, we can just apply a filter to ignore some of the SNPs that don't pass the stringent threshold. What do you think?
Ok - I was also surprised at how long the read-in took. It may be that there is something wrong with the way I am running the script, so I'll have a further exploration and get back to you. I haven't got especially high SNP density in any regions (max is about 15K) so I don't think that should be an issue. Thanks!
An update: There seems to be a general problem with reading .bcor files. I was previously using a custom-made .bcor, so I wasn't sure whether my coding had messed it up, but even when using a .bcor direct from LDStore, it takes >35 minutes to read into polyfun. This is true both for the LDScore script, but also for finemapper.py, which then also shows a subsequent error with SuSie.
/home/coleman/Tools/polyfun/finemapper.py \
--chr 21 \
--start 19000001 \
--end 22000001 \
--out FM_Output/daner_mdd_boma.chr21.19000001.22000001.gz \
--ld mdd_boma_eur_sr-qc.hg19.ch.fl.bgn.bgen.21.19000001.22000001.bcor \
--method susie \
--sumstats Priors_Output/daner_mdd_boma.21.snpvar_ridge_constrained.gz \
--memory 90 \
--max-num-causal 5 \
--allow-missing \
--n 1648 \
--verbose
*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019-2021 Omer Weissbrod
*********************************************************************
[INFO] Loading sumstats file...
[INFO] Loaded sumstats for 90947 SNPs in 0.61 seconds
[INFO] Loading LD file mdd_boma_eur_sr-qc.hg19.ch.fl.bgn.bgen.21.19000001.22000001.bcor
[INFO] Done in 2299.05 seconds
[INFO] Flipping the effect-sign of 4206 SNPs that are flipped compared to the LD panel
[INFO] Starting functionally-informed SuSiE fine-mapping for chromosome 21 BP 19000001-22000001 (8810 SNPs)
[WARNING] R[write to console]: Error in susie_ss(XtX = XtX, Xty = Xty, yty = var_y * (n - 1), n = n, :
XtX matrix contains NA.
Traceback (most recent call last):
File "/home/coleman/Tools/polyfun/finemapper.py", line 733, in finemap
susie_obj = self.susieR.susie_suff_stat(
AttributeError: module 'susieR' has no attribute 'susie_suff_stat'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/coleman/Tools/polyfun/finemapper.py", line 1129, in <module>
verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
File "/home/coleman/Tools/polyfun/finemapper.py", line 758, in finemap
prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/robjects/functions.py", line 198, in __call__
.__call__(*args, **kwargs))
File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/robjects/functions.py", line 125, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/rinterface_lib/conversion.py", line 44, in _
cdata = function(*args, **kwargs)
File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/rinterface.py", line 623, in __call__
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in susie_ss(XtX = XtX, Xty = Xty, yty = var_y * (n - 1), n = n, :
XtX matrix contains NA.
Hi Joni,
Sorry for this. I didn't write the .bcor code (it was written by the author of LDstore), so I don't know it that well. However, looking at the code I found that it always created an LD matrix of 64bit floats (instead of 32bit) which might be the source of the memory hog. This might slow down the code if the computer runs out of memory and has to swap some stuff to disk. So I just applied a very simple fix that changes it back to 32bit, can you please try that? If it doesn't help, I'll try to further speed it up by rewriting the .bcor code in C instead of python, but let's try the simplest solution first.
Separately from this, I'm not sure what's the source of the error you sent me, so I updated the code to verify that the LD matrix only contains valid numbers. If the code crashes because it doesn't, that might mean that the LD file is corrupt. Can you please try out the updated code and let me know how it goes?
Thanks,
Omer
Hi Omer,
Thanks - I'll try the new code and see how that works speed-wise! Don't worry about the Susie error - I think thats an issue with the way I've made the bcors.
Thanks again!
Hi Omer,
Just wanted to add that the time lag is due to the loops in bcor.py. In finemapper.py the code requests to correlations (readcorr([])) for all the SNPs in the file/regions and the n x n loop is read one by one. a simple solution would be to fetch the upper or lower diagonal matrix.
Thanks, the code already reads only the half-diagonal matrix (the .bcor file only stores half of the matrix). One cause for the loading time is that looping in Python is slow. If I find some time I'll try rewriting the code in Cython which could make it about 10x faster.
Hi, Just to commend on this one too: I occurred upon the same issue with @JoniColeman when running though the fine-mapper script and using the --geno flag (i.e. genotype plink files) rather than UKB precomputed LD matrices. For certain chromosomes, the jobs ran fast and efficiently but for some others, it is taking 4-5 days. Also wondering whether this has to do with the memory or whether any file conversion could make these jobs ran faster. Thank you all!
Hi, I suspect the reason for the slowness is simply that the ldstore code is slow. It could be made faster if it weren't written in Python but in a compiled language such as C. A possible workaround would be to write it in Cython, but this is a big undertaking and I don't have the capacity for this right now.
My suggestion is to convert the bcor to npz (using the --cache-dir
flag), which would at least allow fast loading in future runs of the same LD matrix. If anyone has the capacity to rewrite the ldstore code in Cython, this would be a big improvement. I'm happy to guide that person (I did this in when I worked on the PolyFun paper, for a different version of the bcor format, but the new format is pretty different and I don't have the capacity to go back to this...)
@omerwe and @jdblischak I found that calling ldstore2
to read the .bcor
file and output the LD matrix into a text file then reading that back via np.loadtxt()
takes far lesser time than bcor.readCorr()
, I can create a PR to implement this routine
for 20k X 20k snps
ldstore2 --bcor-to-text
takes about 3 min and 20 sec
np.loadtxt()
takes 40-42 sec
readCorr()
on the other hand takes >25 mins
only caveat we need 5-6x more temporary disk space to store the text file briefly before converting it to npz.
@jerome-f wow, that's crazy. I'm happy to accept a PR for this!
Hi,
I am interested in creating priors with PolyFun using approach 3. I would like to pass estimates of LD from the data I have, stored in .bcor files. Is this possible? The --ld-ukb flag does something similar, but seems to be set up for using UK Biobank specifically. I can see how to do the finemapping step with my bcor LD estimates, but I suspect accurate LD estimation in creating the priors will also be very important.
Thanks,
Joni