shaunpwilkinson / aphid

Analysis with Profile Hidden Markov Models
21 stars 4 forks source link

Example on running with multiple cores #4

Open leonjessen opened 5 years ago

leonjessen commented 5 years ago

Hi @shaunpwilkinson ,

I've been reading your post on aphid, but it's sparse on information on how to train using multiple cores.

I have a rather large set of sequences (+100k) and 32 cores, which I'd like to utilise - Could you give an example on how to minimise training time as much as possible? Would the below be a valid workflow and do the trick?

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
library("aphid")
library("ape")

# Load data
# ------------------------------------------------------------------------------
my_seqs = read_tsv("my_seqs.tsv") # Tibble of AA seqs of varying length

# Prepare data
# ------------------------------------------------------------------------------
set.seed(546996)
my_seqs_bin = my_seqs %>% pull(AA_seq) %>% sample %>% str_split('') %>% as.AAbin

# Derive and train pHMM
# ------------------------------------------------------------------------------

# Initialise pHMM
my_phmm = derivePHMM(x          = my_seqs_bin,
                     residues   = 'AMINO',
                     seqweights = NULL)

# Train pHMM
my_phmm_trained = train.PHMM(x          = my_phmm,
                             y          = my_seqs_bin,
                             method     = "BaumWelch", # Seqs vary in length
                             deltaLL    = 0.01,
                             seqweights = NULL,
                             cores      = 30,
                             maxiter    = 10000)

# Save model to file
# ------------------------------------------------------------------------------
writePHMM(x = my_phmm_trained, file = "pHMM_trained.HMMERv3")

# Score sequences
# ------------------------------------------------------------------------------

# Define scoring function for multiple sequences
get_viterbi = function(seq, pHMM){
  x = sapply(seq, function(s_i){
    s_i_bin = s_i %>% str_split('') %>% as.AAbin
    return(Viterbi.PHMM(x = pHMM, y = s_i_bin)$score)
  })
  names(x) = seq
  return(x)
}

# Score sequences
my_seqs = my_seqs %>% mutate(viterbi = get_viterbi(AA_seq, my_phmm_trained))

# Write to file
write_tsv(my_seqs, "my_seqs_with_viterbi_scores.tsv")

Cheers, Leon

shaunpwilkinson commented 4 years ago

Hi Leon, Apologies for the sparse info, I'll see to that as soon as I have time. As far as I can see your example should work fine run on 30 cores (but please let me know if it doesn't!). For smaller jobs it often takes longer to set up the cluster and distribute the processes than it does to just run the job on a single core. But in your case training a model on over 100k sequences it's probably best to use as many cores as you have access to. Of course this assumes that you have sufficient memory to parallelize to this extent, so to start with I would monitor the usage with vmstat or something similar.

Note that you can also set cores to > 1 for derivePHMM() too if you are finding this step is taking a while.

I hope that helps, as I say please let me know if you run into issues.

Cheers, Shaun

leonjessen commented 4 years ago

Hi @shaunpwilkinson, It worked nicely - Thanks for input! I have access to up to 1TB of mem, so I'm all good. It took roughly 10 hours to run! And great package - Let me know if you need some testing :+1: