opain / eQTL_to_TWAS

Methods for converting eQTL summary statistics into TWAS SNP-weights
GNU General Public License v3.0
11 stars 4 forks source link

Error in file(file, ifelse(append, "a", "w")) : cannot open the connection #7

Closed Truman11Zeng closed 12 months ago

Truman11Zeng commented 1 year ago

Analysis started at 2023-03-03 02:53:03 Options are: $sumstats [1] "/mnt/sdb/resource/eQTL/eqtlCatalogue/ge/eQTL2TWAS/Alasoo_2018_ge_macrophage_Salmonella.eQTL2TWAS.hg19.txt"

$id [1] "ENSG00000206503"

$extract [1] "/mnt/sdb/prog/CELLECT/data/ldsc/w_hm3.snplist"

$models [1] "top1" "prscs" "sbayesr" "sbayesr_robust" [5] "susie" "susie1" "dbslmm" "lassosum" [9] "ldpred2"

$plink_ref_chr [1] "/mnt/sdb/resource/PopuGenetics/1KG/LDref19/EUR/chr"

$plink_ref_keep [1] "/mnt/sdb/resources/PopuGenetics/1kGP/LDref38/EUR.sample"

$gctb [1] "/Software/gctb_203"

$gctb_ref [1] "/mnt/sdb/resource/TWASdb/eQTL_to_TWAS/GCTB_ref/ukb_50k_bigset_2.8M/ukb50k_shrunk_chr"

$gcta [1] "/Software/gcta_1.94"

$PRScs_path [1] "/Software/PRScs"

$PRScs_ref_path [1] "/Software/PRScs/ldblk_1kg_eur"

$ld_blocks [1] "/mnt/sdb/prog/focus/pyfocus/data/ld_blocks/grch37.eur.loci.bed"

$rscript [1] "Rscript"

$dbslmm [1] "/Software/DBSLMM/software"

$plink [1] "/Software/plink1.9"

$ldpred2_ref_dir [1] "/mnt/sdb/resource/TWASdb/eQTL_to_TWAS/ldpred2_ref/"

$output [1] "/mnt/sdb/zt"

$help [1] FALSE

Analysis started at 2023-03-03 02:53:03 Computing weights for ENSG00000206503 . Afer extracting gene, sumstats contain 18884 variants. Afer extracting variants, sumstats contain 39 variant-gene associations. Afer extracting variants, sumstats contain 39 unique variants. Analysis finished at 2023-03-03 02:53:28 Analysis duration was 24.52 secs Error in file(file, ifelse(append, "a", "w")) : cannot open the connection Calls: write.table -> file In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file '/mnt/sdb/zt/.done': No such file or directory Execution halted

Truman11Zeng commented 1 year ago

why does it happen? I tried to change the output folder, but failed. Could you help me find the solution? Thanks a lot!!!

opain commented 1 year ago

Thank you for you using the script and highlighting this error.

I think this error is occurring due to all the variants being removed when applying the MAF > 1% filter. I have edited the script in the dev branch to report the number of SNPs remaining after the MAF filter, and to skip the gene if there are no variants left. Please could you try it?

If this is the cause of the error, I would suggest filtering your eQTL summary statistics to retain only variants with a MAF > 1%, or if you would like to model rarer variants, you could modify this filter in the compute_weights.R script.

Let me know, how you get on.

Truman11Zeng commented 1 year ago

Thanks for your time to answer my question! I have sucessfully solved the problem follwing you suggestions. However, I am facing a new tricky problem, eQTL2TWAS have run for a long time on the server, and it have occupied a lot memory, I don't know when it will end. Do you have any suggestions on speeding the process of TWAS models constructing? Thanks a lot!!!!

opain commented 1 year ago

Great!

Yes. I would recommend running the script in parallel as an array job across genes. This allows you to control the number of genes that will run simultaneously to avoid using an unreasonable amount of resources. You can see how I did this when generating the weights based on eQTLGen here. We have SLURM job scheduler (i.e. use sbatch commands) - You may need to modify the code if you have a different scheduler.

Steps:

  1. First make a file listing all genes in your eQTL sumstats. Then make a file listing genes that that the script still needs to be run for. The script generates a file called .done when it completes, so if this file is not present for a given gene, I add it to the 'todo.txt' file. Code here.
  2. Then I create a shell script running the script for each gene, specifying the gene using the 'to_do.txt' file using the array job variable "$SLURM_ARRAY_TASK_ID". Code here. For more info on running array jobs check google.
  3. Finally, you run the shell script as an array job. If you have 100 genes, and you want to run 20 at a time, you would set the "--array" parameter as "--array 1:100%20". In my code, I specifying the number of genes by looking at how many rows are in my todo.txt file. Code here.

If I were you, I would check the job you have running currently is producing the .done files as expected when it completes for each gene, then cancel that job and continue using the process I have described above. This will avoid rerunning the script for genes you have already have results for.

Best wishes,

Ollie