Closed bschilder closed 9 months ago
When I ran MultiEWCE
only ~500/6700 phenotypes completed. The rest seem to have errored out.
The ones I inspected manually all seemed to be due to the "must at least 4 genes" error. But this is weird bc I explicitly check beforehand which gene lists have enough genes (after taking into consideration the intersect with the CTD genes). MultiEWCE
itself has an additional check for the number of valid gene lists, and even in situations where MultiEWCE
determined the gene list was sufficient, EWCE would return this error.
One of the issues can be traced back to an old version of EWCE:
I've just updated R to 4.3 in my "bioc" env and then updated all downstream dependants. Will rerun the enrichment analyses and see if this helps.
Just noticed this error randomly started locally while I was running the enrichments tests on HPC.
I'm currently downloading and storing the CTD and HPO gene list for each job to try and avoid any issues due to reading multiple nodes jobs in resources from the same file at the same time.
However, I think this introduced a second error; hitting the GitHub API download limit, since all the resources are stored on GitHub Releases via piggyback
. This also explains why i didn't see this error in the earliest queries i checked manually after the conda fix.
Prioritising gene targets.
Adding term definitions.
Error in `value[[3L]]()`:
! Cannot access release data for repo "neurogenomics/HPOExplorer".
Check that you have provided a `.token` and that the repo is correctly specified.
GitHub API error (403): API rate limit exceeded for user ID 34280215. If you reach out to GitHub Support for help,
please include the request ID E118:AC81:6DA236:6EC564:65453130.
I could:
This could have also been due to some temporary outages of GitHub that have been going on the last hour or so.
Noticed another error on some runs:
Registered S3 method overwritten by 'ggtree':
method from
fortify.igraph ggnetwork
Validating gene lists..
1 / 1 gene lists are valid.
=>> PBS: job killed: mem 21688436kb exceeded limit 20971520kb
Easy fix bump up memory requested (currently only using 1 core, 20Gb per job).
Parallelising within R can cause issues bc it copies the entire R environment for each thread. i think this is why were so easily reaching memory limits on some jobs.
Allocating 4 cores/job but turning OFF within-R parallelisation via MultiEWCE::gen_results(cores = 1)
may help.
Once you’ve got this sussed, worth adding a test to catch related problems in the future
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Brian M. Schilder @.> Sent: Friday, November 3, 2023 3:27:00 PM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Subscribed @.***> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Repeat phenotype-level enrichment analyses using EWCE (Issue #29)
This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.
When I ran MultiEWCE only ~500/6700 phenotypes completed. The rest seem to have errored out.
The ones I inspected manually all seemed to be due to the "must at least 4 genes" error. But this is weird bc I explicitly check beforehand which gene lists have enough genes (after taking into consideration the intersect with the CTD genes). MultiEWCE itself has an additional check for the number of valid gene lists, and even in situations where MultiEWCE determined the gene list was sufficient, EWCE would return this error.
Potential sources of the issue Outdated conda env
One of the issues can be traced back to an old version of EWCE:
I've just updated R to 4.3 in my "bioc" env and then updated all downstream dependants. Will rerun the enrichment analyses and see if this helps.
— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/29#issuecomment-1792647831, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE2I652FS3562BJPPW3YCUEMJAVCNFSM6AAAAAA6MHLQL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSGY2DOOBTGE. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Making all the updates seems to fix certain cases!
I just ran the following successfully, which was failing previously:
#!/usr/bin/env Rscript
library("optparse")
option_list <-list(
optparse::make_option(c("-i", "--idx"), type="integer",
help="PBS_ARRAY_INDEX", metavar="character"),
optparse::make_option( c("-n", "--ncpus"), type="integer", default=1,
help="Number of CPUs to use.", metavar="character"),
optparse::make_option( c("-b", "--batches"), type="integer", default=7015,
help="Number of total batches.", metavar="character")
);
opt_parser <- optparse::OptionParser(option_list=option_list)
opt <- optparse::parse_args(opt_parser)
root <- "/rds/general/project/neurogenomics-lab/ephemeral/rare_disease"
library(MultiEWCE)
library(data.table)
save_dir <- file.path(root,paste0("BATCH_IDX_",opt$idx))
storage_dir <- file.path(save_dir,"resources")
ctd <- MultiEWCE::load_example_ctd(file = "ctd_DescartesHuman.rds",
save_dir = storage_dir)
#### Load and filter gene data ####
annotLevel <- 2
gene_data <- HPOExplorer::load_phenotype_to_genes(save_dir=storage_dir)
gene_data <- gene_data[gene_symbol %in% rownames(ctd[[annotLevel]]$mean_exp)]
gene_data[,n_gene:=(length(unique(gene_symbol))),by="hpo_id"]
gene_data <- gene_data[n_gene>=4,]
#### Split HPO IDs into N chunks ####
ids <- unique(gene_data$hpo_id)
chunks <- split(ids, cut(seq_along(ids),opt$batches,labels = FALSE))
#### Run enrichment analyses #####
all_results <- MultiEWCE::gen_results(
ctd = ctd,
list_name_column = "hpo_id",
list_names = chunks[[99]],
gene_data = gene_data,
annotLevel = annotLevel,
reps = 100,
cores = opt$ncpus,
force_new = TRUE,
save_dir = save_dir)
Validating gene lists..
1 / 1 gene lists are valid.
Analysing: 'HP:0008186' (1/1): 4 genes.
Computing gene counts.
Saving results ==> /rds/general/project/neurogenomics-lab/ephemeral/rare_disease/BATCH_IDX_/gen_results_2023-11-03_16-39-27.19555.rds
Just launched the pbs script and all 7015 valid phenotypes (with enough genes) should be done in under an hour or so.
Using same resources for all jobs seems to help.
Bumped up to 4 cores and 90Gb, seems to help and not much difference in queuing time.
Avoiding this didn't seem to make a difference in this case.
Still 203 phenotypes missing, all seemingly due to insufficient memory.
Going for a brute force approach and increasing memory to the max (before it goes to a different queue). Thanks to @Al-Murphy for sharing these med-bio queue specs.
#PBS -l walltime=36:00:00
#PBS -l select=1:ncpus=40:mem=128gb -q med-bio
#PBS -J 1-7015
module load anaconda3/personal
source activate bioc
cd $PBS_O_WORKDIR
Rscript /rds/general/project/neurogenomics-lab/live/Projects/rare_disease_ewce/pbs/rare_disease_celltyping.R -i $PBS_ARRAY_INDEX -n 1 -b 7015
Something I'm noticing about these 203 missing phenotypes is that they all seem to have fairly large gene lists. Not sure exactly where in MultiEWCE
/EWCE
this is causing memory usage to increase significantly, but i think this is the underlying reason for the failures:
r$> missing_dat
hpo_id hpo_name ncbi_gene_id gene_symbol disease_id n_gene batch_id
1: HP:0002650 Scoliosis 5290 PIK3CA ORPHA:276280 1047 4
2: HP:0002650 Scoliosis 5290 PIK3CA OMIM:612918 1047 4
3: HP:0002650 Scoliosis 5290 PIK3CA ORPHA:201 1047 4
4: HP:0002650 Scoliosis 5290 PIK3CA OMIM:615108 1047 4
5: HP:0002650 Scoliosis 26235 FBXL4 OMIM:615471 1047 4
---
396184: HP:0000818 Abnormality of the endocrine system 3559 IL2RA OMIM:606367 1365 6904
396185: HP:0000818 Abnormality of the endocrine system 3418 IDH2 ORPHA:163634 1365 6904
396186: HP:0000818 Abnormality of the endocrine system 3417 IDH1 ORPHA:163634 1365 6904
396187: HP:0000818 Abnormality of the endocrine system 54820 NDE1 ORPHA:2177 1365 6904
396188: HP:0000818 Abnormality of the endocrine system 57192 MCOLN1 OMIM:252650 1365 6904
r$> unique(missing_dat[,c("hpo_name","n_gene")])$n_gene|>summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
736 1073 1378 1648 1957 4887
I've further reduced the memory usage by subsetting the CTD and gene data objects (the two largest objects in each R session) to the bare minimum information.
We'll see how much this helps get results for all phenotypes once HPC finishes the array job (currently 4499/7015 phenotypes complete)
All subjobs are now finished but still 99 phenotypes errored out due to insufficient memory (even with optimising memory usage and 128gb of memory!). Will try using a different high-memory queue for these remaining phenotypes.
I've also added some function to MultiEWCE
to allow it to query gprofiler via orthogene once, and use cached results thereafter.
See here for a full discussion on the choice of background genes:
I submitted a PBS job yesterday afternoon requesting the large memory nodes. As of today...it's still queued. @Al-Murphy warned me about this, that requesting large mem nodes on HPC can take forever...
#PBS -l walltime=72:00:00
#PBS -l select=1:ncpus=64:mem=230gb
#PBS -J 1-4
module load anaconda3/personal
source activate bioc
cd $PBS_O_WORKDIR
Rscript /rds/general/project/neurogenomics-lab/live/Projects/rare_disease_ewce/pbs/rare_disease_celltyping.R -i $PBS_ARRAY_INDEX -n 1 -b 4
In the meantime, I'm going to try running these remaining phenotypes on the Threadripper, as it has 252Gb memory which should be sufficient to complete these final 99 phenotypes with large gene lists.
Side note: For the remaining 99 phenotypes, i wanted to see what levels in the HPO ontology they tended to be. And while there are many high-level terms ("phenotypic abnormality", "Abnormality of the nervous system") the levels span from 0-16, showing that's it's not just these highest-level phenotypes that we're missing.
How much RAM did the previous machines have? Had no idea EWCE could use that much memory
How much RAM did the previous machines have? Had no idea EWCE could use that much memory
@NathanSkene by "previous machines" do you mean when Bobby ran it? I don't know how he ran it, but i can ask.
I was also surprised by this, I still haven't narrowed down where it's coming from (MultiEWCE
, EWCE
, my PBS script, HPC peculiarity)
I meant, the machines that crashed giving memory errors. Did the same jobs crash consistently?
I meant, the machines that crashed giving memory errors. Did the same jobs crash consistently?
Yes, it was very consistently the same phenotypes with the same exact error about going over the 128Gb memory limit
ephemeral <- "/rds/general/project/neurogenomics-lab/ephemeral/rare_disease/"
ephemeral_logs <- file.path(dirname(ephemeral),"rare_disease.pbs_output")
missing_dat <- readRDS(here::here("pbs/missing_dat.rds"))
logs <- system(paste("ls",ephemeral_logs,"-Artlsh | tail -10"), intern = TRUE)
jobID <- gsub("[a-z]","",rev(strsplit(tail(logs,1),"\\.")[[1]])[2])
logs_df <- data.table::data.table(batch_id=unique(missing_dat$batch_id))[,
path:=file.path(ephemeral_logs,
paste("rare_disease_celltyping.pbs",
paste0("e",jobID),batch_id,sep=".")
)
][file.exists(path),]
## Print the logs for each batch
out <- lapply(logs_df$path,function(x){
message("\n~~~~~~~~~~",basename(x),"~~~~~~~~~~")
readLines(x)|>cat(sep = "\n")
})
I have these 99 phenotypes running on the threadripper right now, parallelised across 20/64 cores. Seems to working fine there. If for some reason it does prove too much doing this in parallel on Threadripper (still waiting for it to complete), I can make everything single-threaded to give all 252Gb of mem to each subjob.
On the Threadripper, this comes to 252Gb/20jobs = 20Gb memory/subjob. Far less than the 128Gb/subjob than I was using on HPC! This leads me to suspect something weird is going on with HPC.
It's possible the memory problems on HPC were due to:
MultiEWCE
/EWCE
Spoke too soon, all the subjobs on the Threadripper crashed due to hitting the shared memory limit. Usage was relatively low at first (~113Gb) but then climbed up to 250Gb+ before crashing.
Rerunning now single-threaded to reduce memory load.
While that's running, I'm going to launch the Human Cell Landscape round of analyses on HPC!
Ok, so I was able to run the remaining phenotypes but only by reducing the number of iterations from 100k to 10k. While i don't recommend using inconsistent n iterations, this gives me a clue about what might be happening.
One of the upgrades I made in EWCE 1.7.3 was storing the bootstrapped data for every iteration. It's possible that this becomes too much data to keep in memory with 100k iterations and a very large hit genes list.
- Return gene-level scores based on adaptation of code from
generate_bootstrap_plots
. now stored as a list element namedgene_data
indata.table
format.
https://github.com/NathanSkene/EWCE/blob/master/NEWS.md#ewce-173
I'll test this by removing the stored bootstrapped data, but will need to make some mods in EWCE
to enable this feature.
Side note; the high mem jobs on HPC are still queued....after waiting for over 4 days... 😑
One of the upgrades I made in EWCE 1.7.3 was storing the bootstrapped data for every iteration. It's possible that this becomes too much data to keep in memory with 100k iterations and a very large hit genes list.
This was exactly the issue! I've added a new arg to EWCE::bootstrap_enrichment_test
called store_gene_data=
(and exposed it to MultiEWCE
via ...
notation). This lets users turn off computing gene scores.
So why did this explode the memory? I believe it's because the data size explodes when you have a large number of reps hits celltypes. For example, running EWCE with 100k reps, a hit gene list of length 2000, and 77 celltypes would results in a dataframe with >15 billion rows!:
> xfun::numbers_to_words(100000*2000*77)
[1] "fifteen billion, four hundred million"
Setting store_gene_data=FALSE
avoids this and thus drastically reduces memory usage. Using this arg, I'm now running 10 subjobs at once on the threadripper and the max memory usage is stable at around 80-90Gb. So that means only ~8-9Gb / subjob; a huge difference from when each one of these subjobs were crashing a machine with 128Gb!
Great! Good to have that sorted!
Was just about to generate a report of the new results, but then our Private Cloud crashed right as I was about to do so.
I also need the Private Cloud to regenerate the Human Cell Landscape CTD after making some modifications to the levels.
@eduff is working on getting it back up and running.
Private Cloud is back up, and some additional issues with disk storage filling up completely are now resolved. This allowed me to finish up the last of the DescartesHuman analyses, and launch the HumanCellLandscape analyses!
Here's a high level summary, comparing the old (Bobby's CTD + old HPO data) vs. new results (using the standardised CTD, + updated HPO data):
So to summarise, we have 2x the number of sig enrichment test, but ~12% fewer sig phenotypes.
Most importantly, I think the results look a lot more consistently believable (fewer suspect associations, e.g. enteric neurons being associated with intellectual disability).
Here are the significant cell type associations with the top fold-change per phenotype: As you can see, the cell types make a lot of sense!
HPC is undergoing some maintenance apparently (don't recall it being scheduled for today), but as soon as they're done with this my HumanCellLandscape analyses will automatically start running and should be done <24h after that.
In the meantime, I'm working on mapping the 124 cell types in HumanCellLandscape to the 77 cell types in DescartesHuman. Looking for a computational/systematic way to do this, but if that proves too do this, but if that proves too involved I'll do the matching manually.
HumanCellLandscape results finished last week. will report comparisons between HumanCellLandscape vs. DescartesHuman results here:
Rerun the phenotype-level gene set enrichment analyses with
EWCE
using :