Suggestion to avoid ifelse() in string

This one is not as "free" as #37/#38 as there might be a readability/simplicity trade-off & the improvement is not so dramatic (only ~6%).

An alternative is to append after the branch:

pd$tag[pd$terminal] <- paste0("</", pd$token[pd$terminal], ">")

I eschewed that since it might introduce a memory efficiency tradeoff as creating then overwriting a bunch of strings in pd$tag smells like a gc() risk for the global string cache to me. Though I'm not totally sure this approach avoids the issue, at least there will be a lot of repeats in terminal_tag so the cache will have a lot of hits.

But submitting the PR anyway in case it's appealing. Here's the relevant benchmark on a particularly large file:

microbenchmark(times = 100, old = {
  if (!anyNA(pd$line1)) {
    pd$tag <- paste0(
      "<", pd$token,
      " line1=\"", pd$line1,
      "\" col1=\"", pd$col1,
      "\" line2=\"", pd$line2,
      "\" col2=\"", pd$col2,
      "\" start=\"", pd$start,
      "\" end=\"", pd$end,
      "\">",
      if (!is.null(pd$text)) xml_encode(pd$text) else "",
      ifelse(pd$terminal, paste0("</", pd$token, ">"), "")
    )
  }
}, new = {
  terminal_tag <- character(nrow(pd))
  terminal_tag[pd$terminal] <- paste0("</", pd$token[pd$terminal], ">")
  if (!anyNA(pd$line1)) {
    paste0(
      "<", pd$token,
      " line1=\"", pd$line1,
      "\" col1=\"", pd$col1,
      "\" line2=\"", pd$line2,
      "\" col2=\"", pd$col2,
      "\" start=\"", pd$start,
      "\" end=\"", pd$end,
      "\">",
      if (!is.null(pd$text)) xml_encode(pd$text) else "",
      terminal_tag
    )
  }
})
# Unit: milliseconds
#  expr      min       lq      mean   median        uq      max neval cld
#   old 945.0101 987.2273 1035.4537 1011.562 1044.0095 1444.550   100   b
#   new 890.7811 927.3121  962.1157  949.027  975.0718 1288.354   100  a

r-lib / xmlparsedata

Suggestion to avoid ifelse() in string #39