ghm17 / LOGODetect

LOGODetect is a powerful tool to identify small segments that harbor local genetic correlation between two traits/diseases.
GNU General Public License v3.0
23 stars 5 forks source link

Fail to download reference data #24

Closed zqsha closed 5 months ago

zqsha commented 1 year ago

Hi expert, Thanks for developing such an amazing toolbox. Before I start to run this tool, I try to download the reference data by using the following command line. However, the link is not available, and the data still cannot be downloaded without any errors reported. I guess the data has been moved to another folder, so users cannot access it. Do you have any thoughts on it? Thanks. Looking forward to your reply. Best, Cain wget ftp://ftp.biostat.wisc.edu/pub/lu_group/Projects/LOGODetect/LOGODetect_data.tar.gz image

qlu-lab commented 1 year ago

There was a temporary server hiccup several days ago. It should have been resolved by now. Please try again and let us know if the problem persists.

zqsha commented 1 year ago

OK, thanks. Now I can download the reference data. However, when I try the example data the toolbox provides, it reported some errors. Specifically, I first installed all of the packages involved. Given that I often use LDSC to do some other analyses, the version of LDSC that I am using should be compatible with LOGODetect. Here is the command line that I tested

LOGODetect.R \ --sumstats /LOGODetect/LOGODetect_data/sumstats/BIP.txt,/LOGODetect/LOGODetect_data/sumstats/SCZ.txt \ --n_gwas 51710,105318 \ --ref_dir /LOGODetect/LOGODetect_data/LOGODetect_1kg_ref \ --pop EUR \ --ldsc_dir /LOGODetect/ldsc \ --block_partition /LOGODetect/block_partition.txt \ --out_dir /LOGODetect/tt \ --chr 1

It reported the following error: Extracting number of samples and rownames from 1000G_EUR_QC.fam... Extracting number of variants and colnames from 1000G_EUR_QC.bim... File "/home/shaz/software/LOGODetect/ldsc/munge_sumstats.py", line 583 if args.daner_n: TabError: inconsistent use of tabs and spaces in indentation File "/home/shaz/software/LOGODetect/ldsc/munge_sumstats.py", line 583 if args.daner_n: TabError: inconsistent use of tabs and spaces in indentation File "/home/shaz/software/LOGODetect/ldsc/ldsc.py", line 84 print msg ^^^^^^^^^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? Error in file(paste0(out_dir, "/tmp_files/ldsc/ldsc_rg.log"), "r") : cannot open the connection In addition: Warning message: In file(paste0(out_dir, "/tmp_files/ldsc/ldsc_rg.log"), "r") : cannot open file '/home/shaz/software/LOGODetect/tt/tmp_files/ldsc/ldsc_rg.log': No such file or directory Execution halted

It seems there is something wrong with LDSC. But, I often use LDSC, and there is no error reported so far. Do you have any thoughts on fixing this issue? Thank you so much. I really appreciated this.

ghm17 commented 1 year ago

This appears to be inconsistency between python3 and python 2 (see https://github.com/bulik/ldsc/issues/90). Can you try switching to a python 2.7 version of Anaconda and see if the error still shows up?

zqsha commented 1 year ago

Hi expert, Thanks for your tips. We and server managers have been exploring this issue. But, we found a very weird thing, which might cause this issue. Specifically, we first install a miniconda3 and set up a python2-based environment for LDSC.

If we only activate LDSC (without loading R), then the version of Python is under 2.7. However, after loading R/4.2.3, the version of python became 3.10, which might cause this issue, please see the screenshot below. Left: we only activate LDSC (without loading R). Right: activate LDSC while loading R/4.2.3. This might be interesting. Any tips are always welcome. Thanks. image

ghm17 commented 1 year ago

This is actually weird...Can you try manually activating python2 after loading R?

zqsha commented 1 year ago

Hi expert, Thanks for your tips. That's great. Now, it's working again, even though I am not sure whether I will meet additional bugs soon. I will keep you updated once it reported any bugs. Just have a quick question. I am using example data provided by the toolbox to run this local genetic correlation.

LOGODetect.R --sumstats /LOGODetect/LOGODetect_data/sumstats/BIP.txt,/LOGODetect/LOGODetect_data/sumstats/SCZ.txt --n_gwas 51710,105318 --ref_dir /LOGODetect/LOGODetect_data/LOGODetect_1kg_ref --pop EUR --ldsc_dir /LOGODetect/ldsc --block_partition /LOGODetect/block_partition.txt --out_dir /LOGODetect/tt --n_cores 25

I set the number of cores as 25 and the memory as 60GB, as suggested by the online manual. I was just wondering how long it would take to complete this pair of genetic correlation estimation. I guess it might take 3 or 4 days? Thanks. Any thoughts are welcome.

ghm17 commented 1 year ago

It will take around 14 hours for the example analysis. Please let me know if you have any other questions.

zqsha commented 1 year ago

Hi expert, Thank you so much for your suggestions. I really appreciated it. It just now reported errors, as follows:

Extracting number of samples and rownames from 1000G_EUR_QC.fam... Extracting number of variants and colnames from 1000G_EUR_QC.bim...


Interpreting column names as follows: SNP: Variant ID (e.g., rs number) A1: Allele 1, interpreted as ref allele for signed sumstat. A2: Allele 2, interpreted as non-ref allele for signed sumstat. P: p-Value N: Sample size Z: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)

Reading list of SNPs for allele merge from /home/shaz/software/LOGODetect/LOGODetect_data/LOGODetect_1kg_ref/ldsc/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /home/shaz/software/LOGODetect/tt/tmp_files/dat1.txt into memory 1000000 SNPs at a time. ..... done Read 4132142 SNPs from --sumstats file. Removed 3148284 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 983858 SNPs remain. Removed 0 SNPs with duplicated rs numbers (983858 SNPs remain). Removed 0 SNPs with N < 34473.333333333336 (983858 SNPs remain). Median value of Z was 0.0298333012789937, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (983858 SNPs remain). Writing summary statistics for 1217311 SNPs (983858 with nonmissing beta) to /home/shaz/software/LOGODetect/tt/tmp_files/ldsc/dat1_reformated.sumstats.gz.

Metadata: Mean chi^2 = 1.387 Lambda GC = 1.324 Max chi^2 = 56.531 52 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Tue May 23 09:48:21 2023 Total time elapsed: 26.47s


Interpreting column names as follows: SNP: Variant ID (e.g., rs number) A1: Allele 1, interpreted as ref allele for signed sumstat. A2: Allele 2, interpreted as non-ref allele for signed sumstat. P: p-Value N: Sample size Z: Z-score (0 --> no effect; above 0 --> A1 is trait/risk increasing)

Reading list of SNPs for allele merge from /home/shaz/software/LOGODetect/LOGODetect_data/LOGODetect_1kg_ref/ldsc/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from /home/shaz/software/LOGODetect/tt/tmp_files/dat2.txt into memory 1000000 SNPs at a time. ..... done Read 4132142 SNPs from --sumstats file. Removed 3148284 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 983858 SNPs remain. Removed 0 SNPs with duplicated rs numbers (983858 SNPs remain). Removed 0 SNPs with N < 70212.0 (983858 SNPs remain). Median value of Z was 0.0468911269309261, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (983858 SNPs remain). Writing summary statistics for 1217311 SNPs (983858 with nonmissing beta) to /home/shaz/software/LOGODetect/tt/tmp_files/ldsc/dat2_reformated.sumstats.gz.

Metadata: Mean chi^2 = 2.016 Lambda GC = 1.72 Max chi^2 = 154.798 2451 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Tue May 23 09:48:48 2023 Total time elapsed: 26.17s used (Mb) gc trigger (Mb) max used (Mb) Ncells 603558 32.3 14709912 785.6 11686586 624.2 Vcells 15892696 121.3 2870891604 21903.2 3588570743 27378.7 R Version: R version 4.3.0 (2023-04-21)

snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 25 CPUs.

Warning message: In searchCommandline(parallel, cpus = cpus, type = type, socketHosts = socketHosts, : Unknown option on commandline: --file Library snowfall loaded. Library snowfall loaded in cluster.

Error in unserialize(node$con) : error reading from connection Calls: sfLapply ... FUN -> recvData -> recvData.SOCKnode -> unserialize Execution halted

I think it might be related to memory and the number of cores that I set. When I run this example data, I set 160GB and 25 cores. I am not sure whether it works. Alternatively, this job might require more cores and memory? Please let me know your thoughts about this error. Thanks again.

ghm17 commented 1 year ago

Yes, this seems to be out-of-memory issue. The memory usage depends on the number of cores to be specified and also the number of SNPs in the GWAS. 512GB will be sufficient in the example analysis with 25 cores, but the lower bound of memory usage has not been tested. You can also reduce the number of cores and memory usage proportionally to fit requirement in submitting jobs to a cluster.

zqsha commented 1 year ago

OK, great, thanks, Very useful suggestion. I will adjust these parameters for the server.