Open bschilder opened 2 years ago
The following tests were conducted on the Neurogenomics private cloud, within a virtual machine instance with 252GB of RAM, (8?)TB of storage, and 64 CPU cores (128 threads).
MungeSumstats::import_sumstats
Documented here: https://github.com/neurogenomics/MungeSumstats/issues/113
BiocParallel errors
0 remote errors, element index:
50 unevaluated and other errors
first remote error:
read_vcf_parallel
step.subcategories3 <- c("neurological","Immune","cardio")
metagwas3 <- MungeSumstats::find_sumstats(subcategories = subcategories3)
meta <- filter_traits(meta = metagwas3,
group_var = "subcategory",
topn = 100)
gwas_paths <- MungeSumstats::import_sumstats(
ids = meta$id,
save_dir = here::here("data/GWAS_sumstats"),
nThread = 30, # >30 causes issues with read_vcf_parallel
parallel_across_ids = FALSE,
force_new_vcf = FALSE,
force_new = FALSE,
vcf_download = TRUE,
vcf_dir = here::here("data/VCFs"),
### axel will keep trying forever if the URL doesn't exist (or is private)
# download_method = "axel",
#### Record logs
log_folder_ind = TRUE,
log_mungesumstats_msgs = TRUE,
)
MAGMA.Celltyping::map_snps_to_genes
Parallelising map_snps_to_genes
across too many threads actually seems to slow everything down, despite having sufficient memory.
This might be because:
genes_only=TRUE
? but this should make things faster, not slower
genes_only=TRUE
seems to speed things up (thought unsure how significantly).Limit the number of parallel threads to <=20. This results in low memory usage (~11/252GB at any given time, indicated by the green bar in htop
) but, perhaps importantly, ensures that the amount of memory being reserved only reaches ~2/3 of the max (the yellow bar in htop
).
source("https://github.com/neurogenomics/MAGMA_Files_Public/raw/master/code/utils.R")
save_dir <- here::here("data/GWAS_sumstats")
meta <- gather_metadata(save_dir = save_dir,
N_dict=c("Wightman2021"=1126563,
"Vuckovic2020"=408112))
data.table::setkey(meta,id)
t1 <- Sys.time()
magma_files <- parallel::mclapply(seq_len(nrow(meta)),
function(i){
EWCE:::message_parallel("----- ",i," : ",
meta$id[i]," -----")
tryCatch(expr = {
MAGMA.Celltyping::map_snps_to_genes(
# version = "1.08",
path_formatted = meta$munged_path[i],
genome_build = meta$build_final[i],
N = if(is.na(meta$N[i])) NULL else meta$N[i],
population = meta$population_1KG[i],
upstream_kb = 35,
downstream_kb = 10,
genes_only = TRUE,
force_new = FALSE
)
}, error = function(e) {EWCE:::message_parallel(e);NULL})
}, mc.cores = min(nrow(meta),20) ) |> `names<-`(meta$id)
t2 <- Sys.time()
print(t2-t1)
A parallelised version of MAGMA was recently described here: https://star-protocols.cell.com/protocols/1392