knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

vcfR2DNAbin() - no sequence label whith DNAbin list type #127

Open fdchevalier opened 5 years ago

fdchevalier commented 5 years ago

Hi Brian,

I found another unexpected behavior when using vcfR2DNAbin() which is not related to #122. When data have indels with missing data and these indels are kept, a DNAbin object of list type is created (due to the generation of sequences of different lengths) but labels are missing. Below a reproducible example and a solution.

The reproducible example

library(ape)
data(vcfR_test)

# Create an example reference sequence.
nucs <- c('a','c','g','t')
set.seed(9)
myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1))

# Recode the POS data for a smaller example.
set.seed(99)
vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Conversion to DNAbin object
vcfR2DNAbin(vcfR_test, extract.indels=FALSE, unphased_as_NA=FALSE)
# This data has no missing data so sequences of same length can be created
# and DNAbin of matrix type is generated

# Introduction of missing data
vcfR_test@gt[5,2] <- "./.:35:4"

# Conversion to DNAbin object
vcfR2DNAbin(vcfR_test, extract.indels=FALSE, unphased_as_NA=FALSE)
# DNAbin of list type is generated
# Labels are missing

The solution

The problem comes from the transposition when generating the DNAbin object at the end of vcfR2DNAbin(). But this transposition is only needed when a matrix is converted.

Here is the modification proposed:

# Replace line 299 with this block
if (is.matrix(x)) { x <- t(x) }
x <- ape::as.DNAbin(x)

Fred