Open fdchevalier opened 5 years ago
Hi @fdchevalier, thank you for the post and thank you for taking the time to explain the situation and propose a solution! I am currently on holiday, so I won't be able to make much progress on this now. But I'll leave this open so I can revisit it in the new year. Happy holidays!
Hi Brian,
No worries, this can wait next year. I have already made the modifications on the function I use anyway. Enjoy your vacations.
Happy holidays!
Hi Brian,
I found an unexpected behavior of
vcfR2DNAbin()
. I would not qualify this as a bug per se because it is coming from a particular behavior of a DNAbin object created by one function from theape
package (but I don't know if this is a bug in this function as I don't know the specifications of the DNAbin class). Anyway I found a solution forvcfR
to handle all cases.The unexpected behavior
I wanted to include my variants in a reference sequence using
vcfR2DNAbin()
. This reference sequence was extracted from a reference genome imported usingread.dna()
from theape
package. The result produced byvcfR2DNAbin()
contained a sequence of the right length but withNA
for all the invariant bases instead of the expected reference bases. When using the example of thevcfR
manual, everything worked perfectly fine.The explanation
The behavior occurs only when the sequence is stored as a list in the DNAbin object. The list is used when sequences of different lengths are loaded in the same DNAbin object which is the case of my reference genome. A sequence extracted from the reference genome will inherit the list type. But the behavior also depends on what function is used for creating the list: if
as.DNAbin()
is used, everything works fine withvcfR2DNAbin()
but ifread.dna()
is used, the behavior occurs.Here is a reproducible example inspired from
vcfR
manual:The problem comes from the way the data is accessible from each object. This has a big impact on
vcfR2DNAbin()
. The data loaded withas.DNAbin()
is never directly accessible (myRef.db[[1]]
) contrary to the data loaded withread.dna()
(myRef.rd[[1]]
). So whenas.character()
is used (line 262 ofvcfR2DNAbin()
), the binary code is translated in nucleotides in the first case but it is kept as is in the second case. WhenvcfR2DNAbin()
recreates a DNAbin object, in the first case, it translates back the nucleotides into binary code but in the second case, as the binary code is already present, it translates the binary numbers intoNA
because they do not correspond to any nucleotides.Here is an illustration:
The remedy
As I wrote, it might be a bug from
read.dna()
but I found a work around to accommodate any cases when a DNAbin object is of list type. The first step is is to convert the first item of the list to a sequence without trying to access the item itself (so the sequence will still be of list type), unlist the sequence and store the vector for later use. This implies also to modify theif
statement of theis.matrix(ref.seq)
to obtain the sequence and to modify the way the matrix of reference sequences is created.Here are the modifications proposed:
Fred