shaunpwilkinson / aphid

Analysis with Profile Hidden Markov Models
21 stars 4 forks source link

train.PHMM() fails with method "BaumWelch", but works with "Viterbi" #12

Open leonjessen opened 3 weeks ago

leonjessen commented 3 weeks ago

Hi @shaunpwilkinson,

Returning to your fine package here, I encounter an error, when running code, which previously ran fine: Error in FUN(X[[i]], ...) : invalid byte for class 'AAbin'

The error can seems to pertain to the BaumWelch method for training the pHMM and can be reproduced like so:

# Ensure test reproducibility
set.seed(848481)

# Set test variables
std_aa <- c("A", "R", "N", "D", "C", "Q", "E", "G", "H", "I",
            "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
n_seqs <- 100
my_seqs <- replicate(
  n = n_seqs, 
  expr = paste(sample(
    x = std_aa,
    size = sample(x = 5:25, size = 1),
    replace = TRUE), collapse = ""))

# Run Aphid workflow:
my_seqs_aa_bin <- ape::as.AAbin(strsplit(x = my_seqs, split = ""))
phmm_derived <- aphid::derivePHMM.AAbin(x = my_seqs_aa_bin)

# No error:
phmm_trained <- train.PHMM(
  x = phmm_derived,
  y = my_seqs_aa_bin,
  method = "Viterbi")

# Error:
phmm_trained <- train.PHMM(
  x = phmm_derived,
  y = my_seqs_aa_bin,
  method = "BaumWelch")

The error arises both on the CRAN and GitHub versions of the package and I am running ape_5.8, aphid_1.3.3.9000, R version 4.4.1 (2024-06-14), Platform: aarch64-apple-darwin20, Running under: macOS Monterey 12.7.5, Chip: Apple M1 Max and RStudio 2024.04.2+764 "Chocolate Cosmos" Release (e4392fc9ddc21961fd1d0efd47484b43f07a4177, 2024-06-05) for macOS

Thanks, Leon

saeculumx commented 2 weeks ago

Hi @shaunpwilkinson,

I am a student of @leonjessen and have identified the issue causing the problem in the train and utils files.

The error originates from the .disambiguateAA and .encodeAA function:

# train.R, Line 200
y <- .encodeAA(y, arity = 20, ...)

This line involves the .encodeAA function in utils.R:

# utils.R, Line 406
ambigs <- !(v %in% as.raw((65:89)[-c(2, 10, 15, 21, 24)]))

The purpose of this function is to remove any ambiguous bytes using ASCII matching. However, the current implementation incorrectly returns all !TRUE values for every vector in the AAbin object.

To resolve this issue, I have modified the operation to a function that correctly checks the validity of characters:

# Failure Operation
ambigs <- !(v %in% as.raw((65:89)[-c(2, 10, 15, 21, 24)]))

# Success Operation
check_validity <- function(char_list) {
    valid_chars <- c(LETTERS, "*", "-")
    # Check if each character is in the list of valid characters using LETTERS instead of ASCII
    validity <- sapply(char_list, function(x) x %in% valid_chars)
    return(validity)
}
...
fun <- function(v){
      #ambigs <- !(v %in% as.raw((65:89)[-c(2, 10, 15, 21, 24)]))
      ambigs <- !check_validity(v)
      ...
}

This function will produce the desired output for bytes. I have attached the test environment and the fixed package as a ZIP file with this comment for your reference.

res Fixed_aphid.zip

Thank you for your attention to this matter. Please let me know if you have any questions or need further clarification.

Thanks,
Huiyu Xie

shaunpwilkinson commented 5 days ago

Hi @leonjessen and @saeculumx - many thanks for reporting this issue and especially for providing a suggested fix! I will get this patched up as soon as I can, and let you know once the new version is up. Thanks again, Shaun