shaunpwilkinson / aphid

Analysis with Profile Hidden Markov Models
21 stars 4 forks source link

Vectorised version of Viterbi.PHMM()? #8

Open leonjessen opened 3 years ago

leonjessen commented 3 years ago

Hi @shaunpwilkinson,

I'm wondering if there is a vectorised version of Viterbi.PHMM()?

Reproducible example below illustrates how Viterbi.PHMM() can only score one sequence at a time. This leads to substantial computation time, when scoring e.g. 500,000 sequences. Since aphid is c++ based, I was wondering if a vectorised version of Viterbi.PHMM() is available to alleviate this?

# Load libraries ----------------------------------------------------------
library("aphid")
library("ape")

# Set example data --------------------------------------------------------
seqs = c("VGAHAGEY", "VNVDEV", "VEADVAGH",
         "VKGD", "VYSTYETS", "FNANIPKH", "IAGADNGAGV")

# Train model -------------------------------------------------------------
dat_train_bin = as.AAbin(strsplit(seqs, ""))

# Initialise pHMM
phmm_init = derivePHMM.AAbin(x = dat_train_bin, k = 4)

# Train pHMM
phmm_trained = train.PHMM(x = phmm_init, y = dat_train_bin, k = 4)

# Apply model -------------------------------------------------------------

# Does not work
Viterbi(x = phmm_trained, y = dat_train_bin)

# Works
test = as.AAbin(strsplit("VGAHAGEY", ""))
Viterbi(x = phmm_trained, y = test)
shaunpwilkinson commented 3 years ago

Hi @leonjessen,

There is no option within the function to vectorise or multithread this operation, but if you have many sequences to score you can set up a cluster using the parallel package. For example if you are only interested in the log odds scores of the sequences given the model would be:

score <- function(s, model) aphid::Viterbi(model, s, odds = TRUE)$score cl <- parallel::makeCluster(2) scores <- parallel::parSapply(cl, dat_train_bin, score, model = phmm_trained) parallel::stopCluster(cl)

If you are lookinging for the full output of Viterbi.PHMM for each sequence you can remove the $score from the first line and substitute parSapply for parLapply in which case the result will be a list of lists..

Hope that helps!

Cheers, Shaun