I eschewed that since it might introduce a memory efficiency tradeoff as creating then overwriting a bunch of strings in pd$tag smells like a gc() risk for the global string cache to me. Though I'm not totally sure this approach avoids the issue, at least there will be a lot of repeats in terminal_tag so the cache will have a lot of hits.
But submitting the PR anyway in case it's appealing. Here's the relevant benchmark on a particularly large file:
microbenchmark(times = 100, old = {
if (!anyNA(pd$line1)) {
pd$tag <- paste0(
"<", pd$token,
" line1=\"", pd$line1,
"\" col1=\"", pd$col1,
"\" line2=\"", pd$line2,
"\" col2=\"", pd$col2,
"\" start=\"", pd$start,
"\" end=\"", pd$end,
"\">",
if (!is.null(pd$text)) xml_encode(pd$text) else "",
ifelse(pd$terminal, paste0("</", pd$token, ">"), "")
)
}
}, new = {
terminal_tag <- character(nrow(pd))
terminal_tag[pd$terminal] <- paste0("</", pd$token[pd$terminal], ">")
if (!anyNA(pd$line1)) {
paste0(
"<", pd$token,
" line1=\"", pd$line1,
"\" col1=\"", pd$col1,
"\" line2=\"", pd$line2,
"\" col2=\"", pd$col2,
"\" start=\"", pd$start,
"\" end=\"", pd$end,
"\">",
if (!is.null(pd$text)) xml_encode(pd$text) else "",
terminal_tag
)
}
})
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# old 945.0101 987.2273 1035.4537 1011.562 1044.0095 1444.550 100 b
# new 890.7811 927.3121 962.1157 949.027 975.0718 1288.354 100 a
This one is not as "free" as #37/#38 as there might be a readability/simplicity trade-off & the improvement is not so dramatic (only ~6%).
An alternative is to append after the branch:
I eschewed that since it might introduce a memory efficiency tradeoff as creating then overwriting a bunch of strings in
pd$tag
smells like agc()
risk for the global string cache to me. Though I'm not totally sure this approach avoids the issue, at least there will be a lot of repeats interminal_tag
so the cache will have a lot of hits.But submitting the PR anyway in case it's appealing. Here's the relevant benchmark on a particularly large file: