r-lib / ansistrings

Manipulation of ANSI colored strings
Other
9 stars 5 forks source link

Performance Improvements #8

Open brodieG opened 7 years ago

brodieG commented 7 years ago

One early concern I have based on my strwrap implemention is performance. Consider:

library(microbenchmark)
## see test-strwrap.R for y.paste/x.paste
microbenchmark(
  times=10,
  ansi_strwrap(y.paste, width=60),
  strwrap(x.paste, width = 60)
)
## Unit: microseconds
##                               expr       min        lq       mean    median
##  ansi_strwrap(y.paste, width = 60) 42335.964 44105.306 46817.3222 47474.220
##       strwrap(x.paste, width = 60)   924.296   930.628   984.7974  1003.843
##         uq      max neval
##  49831.038 50553.49    10
##   1022.784  1053.60    10

## install_github('brodieG/treeprof')
library(treeprof)
treeprof(ansi_strwrap(y.paste, width = 60))
## Ticks: 1333; Iterations: 105; Time Per: 47.07 milliseconds; Time Total: 4.942 seconds; Time Ticks: 1.333
## 
##                                                   milliseconds
## ansi_strwrap ---------------------------------- : 47.07 - 0.00
##     ansi_substr ------------------------------- : 22.88 - 0.04
##     |   vapply -------------------------------- : 22.85 - 0.07
##     |       FUN ------------------------------- : 22.77 - 0.00
##     |           ansi_substr1 ------------------ : 22.77 - 0.28
##     |               make_ansi_map1 ------------ : 19.56 - 0.35
##     |               |   re_exec_all ----------- : 11.40 - 0.56
##     |               |   $ --------------------- :  5.12 - 0.46
##     |               |   make_shifts1 ---------- :  2.40 - 0.21
##     |               |   .Call ----------------- :  0.28 - 0.28
##     |               paste0 -------------------- :  1.20 - 0.18
##     |               |   paste ----------------- :  0.85 - 0.21
##     |               $ ------------------------- :  1.06 - 0.14
##     |               |   $.data.frame ---------- :  0.92 - 0.07
##     |               vapply -------------------- :  0.60 - 0.28
##     lapply ------------------------------------ : 11.33 - 0.00
##     |   ansi_strsplit ------------------------- : 11.19 - 0.00
##     |       lapply ---------------------------- : 10.56 - 0.00
##     |       |   FUN --------------------------- : 10.56 - 0.04
##     |       |       ansi_substring ------------ : 10.31 - 0.00
##     |       |           ansi_substr ----------- : 10.31 - 0.00
##     |       re_exec_all ----------------------- :  0.39 - 0.04
##     |       strip_style ----------------------- :  0.25 - 0.00
##     |           gsub -------------------------- :  0.25 - 0.25
##     make_ansi_map ----------------------------- :  9.89 - 0.00
##     |   lapply -------------------------------- :  9.89 - 0.00
##     |       FUN ------------------------------- :  9.89 - 0.07
##     |           re_exec_all ------------------- :  5.12 - 0.07
##     |           |   [ ------------------------- :  1.13 - 0.04
##     |           |   |   [.tbl_df -------------- :  1.09 - 0.11
##     |           |   mapply -------------------- :  0.81 - 0.00
##     |           |   |   <Anonymous> ----------- :  0.81 - 0.21
##     |           |   gregexpr ------------------ :  0.81 - 0.81
##     |           |   lapply -------------------- :  0.64 - 0.04
##     |           |   |   FUN ------------------- :  0.60 - 0.04
##     |           |   stopifnot ----------------- :  0.49 - 0.14
##     |           |   |   re_ansi --------------- :  0.25 - 0.07
##     |           |   structure ----------------- :  0.42 - 0.14
##     |           |   replicate ----------------- :  0.25 - 0.00
##     |           |       sapply ---------------- :  0.25 - 0.04
##     |           $ ----------------------------- :  3.21 - 0.25
##     |           |   $.tbl_df ------------------ :  1.69 - 0.18
##     |           |   |   has_name -------------- :  1.45 - 0.18
##     |           |   $.rematch_allrecords ------ :  1.27 - 0.04
##     |           |       lapply ---------------- :  1.02 - 0.32
##     |           make_shifts1 ------------------ :  1.31 - 0.04
##     |               $ ------------------------- :  1.24 - 0.21
##     |                   $.tbl_df -------------- :  0.56 - 0.07
##     |                   $.rematch_allrecords -- :  0.46 - 0.00
##     Map --------------------------------------- :  2.58 - 0.00
##         mapply -------------------------------- :  2.58 - 0.07
##             <Anonymous> ----------------------- :  2.51 - 0.46
##                 ansi_nchar -------------------- :  0.67 - 0.00
##                 |   nchar --------------------- :  0.67 - 0.00
##                 |       strip_style ----------- :  0.67 - 0.04
##                 vapply ------------------------ :  0.53 - 0.18
##                 |   FUN ----------------------- :  0.32 - 0.18
##                 tail -------------------------- :  0.46 - 0.14
##                     tail.default -------------- :  0.32 - 0.04
##                         stopifnot ------------- :  0.28 - 0.04

Making the ansi map for each string independently looks to be expensive due to the required re_exec call. Let me know if I'm just doing something wrong.

Assuming I didn't mess things up and the stuff I think is slow is actually slow, I'm thinking that we could make the ansi maps without using re_exec at all since what we're trying to match are fixed strings so it should be okay to just do in C by reading one char at a time looking for tags (I've never tried to read UTF8 string in C so I don't know if this actually gets really complicated). This will also allow us to build the ANSI state in an as-read basis, which is the same as the terminal. Also, dealing with things like \033[1;33;41;5m becomes pretty straightforward.

brodieG commented 7 years ago

One possible relatively simple improvement is to:

This should be particularly good for the use case in strwrap where we repeatedly substring the same string.

brodieG commented 7 years ago

@gaborcsardi I took a stab at directly computing ANSI state in C and using that to implement a new version of ansi_substr (ansi_substr2). I put the results on a branch in my fork (not PRing because it's a mess, it isn't tested, only works for ASCII-128 and even then not exactly right, I haven't talked to you about how you want stuff structured, etc) and I think the early results are promising.

Performance is improved:

  x <- paste(readLines(file.path(R.home("doc"), "THANKS")), collapse = "\n")
  ## Split into paragraphs and remove the first three ones
  x <- unlist(strsplit(x, "\n[ \t\n]*\n"))[-(1:3)]

  ## Pull out the short paragraphs

  y <- tail(x, -1L)

  # We will test by adding coloring to the original test string, and confirming
  # that it ends up the same.  Let's color anything that starts with an upper
  # case letter

  uc.first.loc <- gregexpr("\\b[A-Z]\\w*\\b", y)
  uc.first <- regmatches(y, uc.first.loc)
  y.col <- y
  regmatches(y.col, uc.first.loc) <- lapply(uc.first, crayon::red)

  # and color some entire sentences

  y.col[c(2, 4, 6)] <- crayon::inverse(y.col[c(2, 4, 6)])
  y.paste <- paste(y.col, collapse = "\n\n")
  x.paste <- crayon::strip_style(y.paste)

  # writeLines(ansi_strwrap(y.paste, width = 60))

  starts <- 0:20 * 60 + 1
  stops <- 1:21 * 60
  y.no.nl <- gsub('\n', ' ', y.paste)
  y.rep <- rep(y.no.nl, length(starts))
  x.rep <- crayon::strip_style(y.rep)
  microbenchmark(times=10,
    xx <- ansi_substr(y.rep, starts, stops),
    yy <- ansi_substr2(y.rep, starts, stops),
    zz <- substr(x.rep, starts, stops)
  )
  ## Unit: microseconds
  ##                                      expr       min        lq       mean
  ##   xx <- ansi_substr(y.rep, starts, stops) 32003.939 34160.378 35565.5400
  ##  yy <- ansi_substr2(y.rep, starts, stops)   188.053   191.396   225.5647
  ##        zz <- substr(x.rep, starts, stops)    24.413    24.661    27.9602
  ##      median        uq       max neval
  ##  35413.8725 37561.690 38766.650    10
  ##    232.3210   252.102   267.456    10
  ##     26.3995    28.975    41.744    10

and while the code isn't ready for primetime on it's face it does more or less what it is supposed to: image

Obviously not exactly right but close enough as a proof of concept. Additionally, this has the benefit of easily handling ANSI tags of the form 033[31;47m, mismatched tags, etc., by parsing the strings in a way more akin to terminals do (i.e. read; hit tag; change state; read; hit tag; change state; etc.).

Let me know if you're interested in this and I can pursue further, cleanup, etc.

Side note, are you up for making this repo public? I realize it's not even close to ready for general deployment, but I would prefer to contribute to a public repo rather than a private one.

gaborcsardi commented 7 years ago

I think this is promising! I would prefer a more generic implementation, though, either based on regular expressions or some more principled parser. That is easier to handle I think.

brodieG commented 7 years ago

Probably would go with "some more principled parser" here, mostly because the regexes are part of the problem here (slowness, tags that aren't closed, etc.). That said, I'm not entirely sure what you mean by that so it would be useful to clearly outline what the parsing code is expected to do.

Also, as a side note, I'm working on a couple of other projects right now so I probably won't be able to do anything here in the next month or two.

gaborcsardi commented 7 years ago

For a parser, I would probably define a grammar for it. But for now I think we can go with the regular expressions, hopefully this part can be sped up later without affecting the rest of the codebase.

brodieG commented 7 years ago

Do you mean a human readable representation of the grammar in the sources (e.g. as is the case with the regex tags)?

Otherwise the grammar I (in theory) implemented is the CSI code grammar as defined on wikipedia. This isn't obvious by looking at the code but it should be obvious from (as of yet non-existent) tests whether it is implemented correctly or not. Are you proposing defining a different implementation of the grammar? Or again is your concern that the grammar is not discernible from visual inspection of the code?

Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m". This goes against my earlier comment in #6, but since it actually is perfectly valid ANSI CSI escape sequence grammar with the middle tag acting both as an opening and closing tag, I'm beginning to change my mind on that. This would have implications for how we structure things. We can still use regex for now, but the semantics of the grammar do not naturally align with a "open|close" data structure.

gaborcsardi commented 7 years ago

Do you mean a human readable representation of the grammar in the sources (e.g. as is the case with the regex tags)?

Yeah, a bison/yacc grammar.

Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m".

I think we support that. We only don't support the multiple commands in the same escape sequence case, I think. But yes, we can support that as well, later.

brodieG commented 7 years ago

Note, made new issue for the parsing discussion (#10).

Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m".

I think we support that. We only don't support the multiple commands in the same escape sequence case, I think. But yes, we can support that as well, later.

Ok, so it's just the segfault issue in #6. Without looking into the details of the source it seemed that the structure returned by make_ansi_map might not support it, though it certainly could by repeating a tag. Couldn't actually check b/c of segfault.