neurogenomics / MAGMA_Celltyping

Find causal cell-types underlying complex trait genetics
https://neurogenomics.github.io/MAGMA_Celltyping
70 stars 31 forks source link

Parallelise MAGMA steps #110

Open bschilder opened 2 years ago

bschilder commented 2 years ago

A parallelised version of MAGMA was recently described here: https://star-protocols.cell.com/protocols/1392

bschilder commented 2 years ago

Notes on parallelisation

The following tests were conducted on the Neurogenomics private cloud, within a virtual machine instance with 252GB of RAM, (8?)TB of storage, and 64 CPU cores (128 threads).

MungeSumstats::import_sumstats

Error description

Documented here: https://github.com/neurogenomics/MungeSumstats/issues/113

BiocParallel errors
  0 remote errors, element index: 
  50 unevaluated and other errors
  first remote error:

Current solution

Reprex

subcategories3 <- c("neurological","Immune","cardio")
metagwas3 <- MungeSumstats::find_sumstats(subcategories = subcategories3)
meta <- filter_traits(meta = metagwas3,  
                           group_var = "subcategory",
                           topn = 100) 

gwas_paths <- MungeSumstats::import_sumstats(
  ids = meta$id, 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"),
  ### axel will keep trying forever if the URL doesn't exist (or is private)
  # download_method = "axel",
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  ) 

MAGMA.Celltyping::map_snps_to_genes

Error description

Parallelising map_snps_to_genes across too many threads actually seems to slow everything down, despite having sufficient memory.

This might be because:

  1. They are all calling to the same underlying MAGMA C code.
  2. They are reserving more memory in C than the computer thinks is available. (seems likely)
  3. They are all reading from the same reference files (e.g. 1KG).
  4. MAGMA is somehow already parallelising processes internally.
  5. Docker is launching some processes that don't get killed when restarting the container.
  6. Intermediate MAGMA files are getting repeatedly appended and causing problems in subsequent reruns.
  7. genes_only=TRUE? but this should make things faster, not slower
    • Confirmed that setting genes_only=TRUE seems to speed things up (thought unsure how significantly).
  8. Newer versions of MAGMA(v1.10) has bugs that causes everything to slow down.
    • MAGMAv1.08 might be a bit faster, but hard to tell bc uses diff files each time.

Current solution

Limit the number of parallel threads to <=20. This results in low memory usage (~11/252GB at any given time, indicated by the green bar in htop) but, perhaps importantly, ensures that the amount of memory being reserved only reaches ~2/3 of the max (the yellow bar in htop).

Screenshot 2022-08-10 at 11 45 51

Reprex

source("https://github.com/neurogenomics/MAGMA_Files_Public/raw/master/code/utils.R")

save_dir <- here::here("data/GWAS_sumstats")
meta <- gather_metadata(save_dir = save_dir,
                        N_dict=c("Wightman2021"=1126563, 
                                 "Vuckovic2020"=408112))
data.table::setkey(meta,id)

t1 <- Sys.time()
magma_files <- parallel::mclapply(seq_len(nrow(meta)),
                                  function(i){ 
  EWCE:::message_parallel("----- ",i," : ",
                                      meta$id[i]," -----") 
     tryCatch(expr = { 
       MAGMA.Celltyping::map_snps_to_genes(
         # version = "1.08",
         path_formatted = meta$munged_path[i],
         genome_build = meta$build_final[i],
         N = if(is.na(meta$N[i])) NULL else meta$N[i],
         population = meta$population_1KG[i],
         upstream_kb = 35,  
         downstream_kb = 10,
         genes_only = TRUE, 
         force_new = FALSE
         )
     }, error = function(e) {EWCE:::message_parallel(e);NULL}) 
}, mc.cores = min(nrow(meta),20) ) |> `names<-`(meta$id) 
t2 <- Sys.time()
print(t2-t1)