parseGenBank bug in detecting ORIGIN field

naikymen commented 4 years ago

The parseGenBank function uses a regex to detect genbank field names.

The pattern matches lines starting with upper-case words followed by spaces: "^[[:upper:]]+[[:space:]]+"

However, if line is simply ORIGIN with no trailing spaces, it will fail to detect it.

This, in turn, cascades onto many other problems.

Here is my solution, in case anyone finds it useful.

Also note that dss = Biostrings::extractAt(origin, IRanges::ranges(srcs)) failed because some of my sequences have no "source" feature, because they are Benchling exports. I submitted #12 for this bug.

parseGenBank2 <- function (file, text = readLines(file), partial = NA, verbose = FALSE, 
  ret.anno = TRUE, ret.seq = TRUE) 
{
  if (!ret.anno && !ret.seq) 
    stop("Must return at least one of annotations or sequence.")
  bf = proc.time()["elapsed"]
  if (missing(text) && !file.exists(file)) 
    stop("No text provided and file does not exist or was not specified. Either an existing file or text to parse must be provided.")
  if (length(text) == 1) 
    text = genbankr:::fastwriteread(text)
  prime_field_re <- "^[[:upper:]]+[[:space:]]*"
  fldlines = grepl(prime_field_re, text)
  fldfac = cumsum(fldlines)
  fldnames = gsub("^([[:upper:]]+).*", "\\1", text[fldlines])[fldfac]
  spl = split(text, fldnames)
  resthang = list(LOCUS = genbankr:::readLocus(spl[["LOCUS"]]))
  resthang[["FEATURES"]] = genbankr:::readFeatures(spl[["FEATURES"]], 
    source.only = !ret.anno, partial = partial)
  seqtype = genbankr:::.seqTypeFromLocus(resthang$LOCUS)
  resthang$ORIGIN = if (ret.seq) 
    genbankr:::readOrigin(spl[["ORIGIN"]], seqtype = seqtype)
  else NULL
  if (ret.anno) {
    resthang2 = mapply(function(field, lines, verbose) {
      switch(field, DEFINITION = genbankr:::readDefinition(lines), 
        ACCESSION = genbankr:::readAccession(lines), VERSION = genbankr:::readVersions(lines), 
        KEYWORDS = genbankr:::readKeywords(lines), SOURCE = genbankr:::readSource(lines), 
        NULL)
    }, lines = spl, field = names(spl), SIMPLIFY = FALSE, 
      verbose = verbose)
    resthang2$FEATURES = resthang2$FEATURES[sapply(resthang2$FEATURES, 
      function(x) length(x) > 0)]
    resthang2 = resthang2[!names(resthang2) %in% names(resthang)]
    resthang = c(resthang, resthang2)
  }
  origin = resthang$ORIGIN
  if (ret.seq && length(origin) > 0) {
    typs = sapply(resthang$FEATURES, function(x) x$type[1])
    source_types <- resthang$FEATURES[typs == "source"]

    if(length(source_types) > 0){
      srcs = genbankr:::fill_stack_df(source_types)
      dss = Biostrings::extractAt(origin, IRanges::ranges(srcs))
      names(dss) = as.character(GenomeInfoDb::seqnames(srcs))
    if (ret.anno) 
      resthang$ORIGIN = dss
    else 
      resthang = dss
    } else {
      warning("No 'source' feature types in entry.")
    }

  } else if (!ret.anno) {
    stop("Asked for only sequence (ret.anno=FALSE) from a file with no sequence information")
  }
  af = proc.time()["elapsed"]
  if (verbose) 
    message("Done Parsing raw GenBank file text. [ ", af - 
      bf, " seconds ]")
  resthang
}

naikymen commented 4 years ago

Note that, because of this, the whole sequence was appended to the last "label" field in the features dataframe:

sbthandras commented 2 years ago

I can confirm this also happens to me

gmbecker / genbankr

parseGenBank bug in detecting ORIGIN field #11