The parseGenBank function uses a regex to detect genbank field names.
The pattern matches lines starting with upper-case words followed by spaces: "^[[:upper:]]+[[:space:]]+"
However, if line is simply ORIGIN with no trailing spaces, it will fail to detect it.
This, in turn, cascades onto many other problems.
Here is my solution, in case anyone finds it useful.
Also note that dss = Biostrings::extractAt(origin, IRanges::ranges(srcs)) failed because some of my sequences have no "source" feature, because they are Benchling exports. I submitted #12 for this bug.
parseGenBank2 <- function (file, text = readLines(file), partial = NA, verbose = FALSE,
ret.anno = TRUE, ret.seq = TRUE)
{
if (!ret.anno && !ret.seq)
stop("Must return at least one of annotations or sequence.")
bf = proc.time()["elapsed"]
if (missing(text) && !file.exists(file))
stop("No text provided and file does not exist or was not specified. Either an existing file or text to parse must be provided.")
if (length(text) == 1)
text = genbankr:::fastwriteread(text)
prime_field_re <- "^[[:upper:]]+[[:space:]]*"
fldlines = grepl(prime_field_re, text)
fldfac = cumsum(fldlines)
fldnames = gsub("^([[:upper:]]+).*", "\\1", text[fldlines])[fldfac]
spl = split(text, fldnames)
resthang = list(LOCUS = genbankr:::readLocus(spl[["LOCUS"]]))
resthang[["FEATURES"]] = genbankr:::readFeatures(spl[["FEATURES"]],
source.only = !ret.anno, partial = partial)
seqtype = genbankr:::.seqTypeFromLocus(resthang$LOCUS)
resthang$ORIGIN = if (ret.seq)
genbankr:::readOrigin(spl[["ORIGIN"]], seqtype = seqtype)
else NULL
if (ret.anno) {
resthang2 = mapply(function(field, lines, verbose) {
switch(field, DEFINITION = genbankr:::readDefinition(lines),
ACCESSION = genbankr:::readAccession(lines), VERSION = genbankr:::readVersions(lines),
KEYWORDS = genbankr:::readKeywords(lines), SOURCE = genbankr:::readSource(lines),
NULL)
}, lines = spl, field = names(spl), SIMPLIFY = FALSE,
verbose = verbose)
resthang2$FEATURES = resthang2$FEATURES[sapply(resthang2$FEATURES,
function(x) length(x) > 0)]
resthang2 = resthang2[!names(resthang2) %in% names(resthang)]
resthang = c(resthang, resthang2)
}
origin = resthang$ORIGIN
if (ret.seq && length(origin) > 0) {
typs = sapply(resthang$FEATURES, function(x) x$type[1])
source_types <- resthang$FEATURES[typs == "source"]
if(length(source_types) > 0){
srcs = genbankr:::fill_stack_df(source_types)
dss = Biostrings::extractAt(origin, IRanges::ranges(srcs))
names(dss) = as.character(GenomeInfoDb::seqnames(srcs))
if (ret.anno)
resthang$ORIGIN = dss
else
resthang = dss
} else {
warning("No 'source' feature types in entry.")
}
} else if (!ret.anno) {
stop("Asked for only sequence (ret.anno=FALSE) from a file with no sequence information")
}
af = proc.time()["elapsed"]
if (verbose)
message("Done Parsing raw GenBank file text. [ ", af -
bf, " seconds ]")
resthang
}
The
parseGenBank
function uses a regex to detect genbank field names.The pattern matches lines starting with upper-case words followed by spaces:
"^[[:upper:]]+[[:space:]]+"
However, if line is simply
ORIGIN
with no trailing spaces, it will fail to detect it.This, in turn, cascades onto many other problems.
Here is my solution, in case anyone finds it useful.
Also note that
dss = Biostrings::extractAt(origin, IRanges::ranges(srcs))
failed because some of my sequences have no "source" feature, because they are Benchling exports. I submitted #12 for this bug.