knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

vcfR2DNAbin() - problem with DNAbin list type #122

Open fdchevalier opened 5 years ago

fdchevalier commented 5 years ago

Hi Brian,

I found an unexpected behavior of vcfR2DNAbin(). I would not qualify this as a bug per se because it is coming from a particular behavior of a DNAbin object created by one function from the ape package (but I don't know if this is a bug in this function as I don't know the specifications of the DNAbin class). Anyway I found a solution for vcfR to handle all cases.

The unexpected behavior

I wanted to include my variants in a reference sequence using vcfR2DNAbin(). This reference sequence was extracted from a reference genome imported using read.dna() from the ape package. The result produced by vcfR2DNAbin() contained a sequence of the right length but with NA for all the invariant bases instead of the expected reference bases. When using the example of the vcfR manual, everything worked perfectly fine.

The explanation

The behavior occurs only when the sequence is stored as a list in the DNAbin object. The list is used when sequences of different lengths are loaded in the same DNAbin object which is the case of my reference genome. A sequence extracted from the reference genome will inherit the list type. But the behavior also depends on what function is used for creating the list: if as.DNAbin() is used, everything works fine with vcfR2DNAbin() but if read.dna() is used, the behavior occurs.

Here is a reproducible example inspired from vcfR manual:

library(ape)
data(vcfR-test)

set.seed(9)

vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Reference sequence
nucs <- c('a','c','g','t')
mySeq <- nucs[round(runif(n=20, min=0.5, max=4.5))]

myRef.mat <- matrix(mySeq, nrow=1)
rownames(myRef.mat) <- "test"

myRef.ls <- list(mySeq)
names(myRef.ls) <- "test"

## Fasta file for read.dna()
cat(">test",
    paste(mySeq, collapse=""),
    file = "exdna.fas", sep="\n")

# Loading reference sequence as matrix
myRef.db <- as.DNAbin(myRef.mat)
myRef.rd <- read.dna("exdna.fas", format="fasta")

# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# TRUE

# Loading reference sequence as list
myRef.db <- as.DNAbin(myRef.ls)
myRef.rd <- read.dna("exdna.fas", format="fasta", as.matrix=FALSE)

# Do both methods produce the same object?
identical(myRef.db, myRef.rd)
# FALSE

# Results from vcfR2DNA
myDNA.db <- vcfR2DNAbin(vcfR_test, ref.seq=myRef.db)
myDNA.rd <- vcfR2DNAbin(vcfR_test, ref.seq=myRef.rd)

as.character(myDNA.db)
as.character(myDNA.rd)

The problem comes from the way the data is accessible from each object. This has a big impact on vcfR2DNAbin(). The data loaded with as.DNAbin() is never directly accessible (myRef.db[[1]]) contrary to the data loaded with read.dna() (myRef.rd[[1]]). So when as.character() is used (line 262 of vcfR2DNAbin()), the binary code is translated in nucleotides in the first case but it is kept as is in the second case. When vcfR2DNAbin() recreates a DNAbin object, in the first case, it translates back the nucleotides into binary code but in the second case, as the binary code is already present, it translates the binary numbers into NA because they do not correspond to any nucleotides.

Here is an illustration:

as.character(myRef.db[[1]])
as.character(myRef.rd[[1]])

The remedy

As I wrote, it might be a bug from read.dna() but I found a work around to accommodate any cases when a DNAbin object is of list type. The first step is is to convert the first item of the list to a sequence without trying to access the item itself (so the sequence will still be of list type), unlist the sequence and store the vector for later use. This implies also to modify the if statement of the is.matrix(ref.seq) to obtain the sequence and to modify the way the matrix of reference sequences is created.

Here are the modifications proposed:

# Replace lines 159-162 with this block
if (is.list(ref.seq)) {
    ref.seq <- unlist(as.character(ref.seq[1]), use.names=FALSE)
}

# Replace lines 163-166 with this block
if (is.matrix(ref.seq)) {
    ref.seq <- ref.seq[1, ]
    ref.seq <- ref.seq[1:ncol(ref.seq)]
    ref.seq <- as.character(ref.seq)
}

# Replace line 262 with this line
x <- matrix(ref.seq, nrow = length(ref.seq),

Fred

knausb commented 5 years ago

Hi @fdchevalier, thank you for the post and thank you for taking the time to explain the situation and propose a solution! I am currently on holiday, so I won't be able to make much progress on this now. But I'll leave this open so I can revisit it in the new year. Happy holidays!

fdchevalier commented 5 years ago

Hi Brian,

No worries, this can wait next year. I have already made the modifications on the function I use anyway. Enjoy your vacations.

Happy holidays!