gmbecker / genbankr

http://bioconductor.org/packages/devel/bioc/html/genbankr.html
14 stars 9 forks source link

Reading multi-record genbank files #3

Closed jbisanz closed 4 years ago

jbisanz commented 7 years ago

Hi, thanks for this package.

I am trying to import some bacterial draft genomes into R using readGenBank however I am getting the following message: Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW), : key 79 (char 'O') not in lookup table

After tracing through this function it appears that the problem lies in readOrigin, and in particular that "ORIGIN" is located in the variable chars which contains the sequence. Am I correctly interpreting from this functionality that readGenBank is not able read a multiple record gbk file? If so, is this functionality planned for the future?

mikemc commented 4 years ago

Perhaps useful to others having this problem. A hack I have found to get around this issue is to simply split the file text within R into separate strings for each record and then call readGenBank on each record separately. Using tidyverse functions,

library(tidyverse)

fn <- "path/to/project.gbff.gz"
txt <- read_file(fn)
# Split into strings for individual records
txt.split <- txt %>%
  str_split("\n//\n") %>%
  unlist
# The last element should be an empty string ("") and will cause an error
txt.split <- txt.split[txt.split != ""]
# Get a list of GenBankRecord objects for each record
recs <- txt.split %>%
  map(~genbankr::readGenBank(NULL, text = .))
recs[[1]]
gmbecker commented 4 years ago

This is (finally, apologies) fixed (as a first pass at least) in (devel) versions >= 1.15.1.