Open brodieG opened 7 years ago
One possible relatively simple improvement is to:
ansi_substr1
to accept the mapping as givenThis should be particularly good for the use case in strwrap
where we repeatedly substring the same string.
@gaborcsardi I took a stab at directly computing ANSI state in C and using that to implement a new version of ansi_substr
(ansi_substr2
). I put the results on a branch in my fork (not PRing because it's a mess, it isn't tested, only works for ASCII-128 and even then not exactly right, I haven't talked to you about how you want stuff structured, etc) and I think the early results are promising.
Performance is improved:
x <- paste(readLines(file.path(R.home("doc"), "THANKS")), collapse = "\n")
## Split into paragraphs and remove the first three ones
x <- unlist(strsplit(x, "\n[ \t\n]*\n"))[-(1:3)]
## Pull out the short paragraphs
y <- tail(x, -1L)
# We will test by adding coloring to the original test string, and confirming
# that it ends up the same. Let's color anything that starts with an upper
# case letter
uc.first.loc <- gregexpr("\\b[A-Z]\\w*\\b", y)
uc.first <- regmatches(y, uc.first.loc)
y.col <- y
regmatches(y.col, uc.first.loc) <- lapply(uc.first, crayon::red)
# and color some entire sentences
y.col[c(2, 4, 6)] <- crayon::inverse(y.col[c(2, 4, 6)])
y.paste <- paste(y.col, collapse = "\n\n")
x.paste <- crayon::strip_style(y.paste)
# writeLines(ansi_strwrap(y.paste, width = 60))
starts <- 0:20 * 60 + 1
stops <- 1:21 * 60
y.no.nl <- gsub('\n', ' ', y.paste)
y.rep <- rep(y.no.nl, length(starts))
x.rep <- crayon::strip_style(y.rep)
microbenchmark(times=10,
xx <- ansi_substr(y.rep, starts, stops),
yy <- ansi_substr2(y.rep, starts, stops),
zz <- substr(x.rep, starts, stops)
)
## Unit: microseconds
## expr min lq mean
## xx <- ansi_substr(y.rep, starts, stops) 32003.939 34160.378 35565.5400
## yy <- ansi_substr2(y.rep, starts, stops) 188.053 191.396 225.5647
## zz <- substr(x.rep, starts, stops) 24.413 24.661 27.9602
## median uq max neval
## 35413.8725 37561.690 38766.650 10
## 232.3210 252.102 267.456 10
## 26.3995 28.975 41.744 10
and while the code isn't ready for primetime on it's face it does more or less what it is supposed to:
Obviously not exactly right but close enough as a proof of concept. Additionally, this has the benefit of easily handling ANSI tags of the form 033[31;47m
, mismatched tags, etc., by parsing the strings in a way more akin to terminals do (i.e. read; hit tag; change state; read; hit tag; change state; etc.).
Let me know if you're interested in this and I can pursue further, cleanup, etc.
Side note, are you up for making this repo public? I realize it's not even close to ready for general deployment, but I would prefer to contribute to a public repo rather than a private one.
I think this is promising! I would prefer a more generic implementation, though, either based on regular expressions or some more principled parser. That is easier to handle I think.
Probably would go with "some more principled parser" here, mostly because the regexes are part of the problem here (slowness, tags that aren't closed, etc.). That said, I'm not entirely sure what you mean by that so it would be useful to clearly outline what the parsing code is expected to do.
Also, as a side note, I'm working on a couple of other projects right now so I probably won't be able to do anything here in the next month or two.
For a parser, I would probably define a grammar for it. But for now I think we can go with the regular expressions, hopefully this part can be sped up later without affecting the rest of the codebase.
Do you mean a human readable representation of the grammar in the sources (e.g. as is the case with the regex tags)?
Otherwise the grammar I (in theory) implemented is the CSI code grammar as defined on wikipedia. This isn't obvious by looking at the code but it should be obvious from (as of yet non-existent) tests whether it is implemented correctly or not. Are you proposing defining a different implementation of the grammar? Or again is your concern that the grammar is not discernible from visual inspection of the code?
Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m"
. This goes against my earlier comment in #6, but since it actually is perfectly valid ANSI CSI escape sequence grammar with the middle tag acting both as an opening and closing tag, I'm beginning to change my mind on that. This would have implications for how we structure things. We can still use regex for now, but the semantics of the grammar do not naturally align with a "open|close" data structure.
Do you mean a human readable representation of the grammar in the sources (e.g. as is the case with the regex tags)?
Yeah, a bison/yacc grammar.
Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m".
I think we support that. We only don't support the multiple commands in the same escape sequence case, I think. But yes, we can support that as well, later.
Note, made new issue for the parsing discussion (#10).
Also, as an aside that may affect how things get implemented, I think we should consider supporting things such as "\033[31mhello\033[32mgoodbye\033[0m".
I think we support that. We only don't support the multiple commands in the same escape sequence case, I think. But yes, we can support that as well, later.
Ok, so it's just the segfault issue in #6. Without looking into the details of the source it seemed that the structure returned by make_ansi_map
might not support it, though it certainly could by repeating a tag. Couldn't actually check b/c of segfault.
One early concern I have based on my strwrap implemention is performance. Consider:
Making the ansi map for each string independently looks to be expensive due to the required re_exec call. Let me know if I'm just doing something wrong.
Assuming I didn't mess things up and the stuff I think is slow is actually slow, I'm thinking that we could make the ansi maps without using re_exec at all since what we're trying to match are fixed strings so it should be okay to just do in C by reading one char at a time looking for tags (I've never tried to read UTF8 string in C so I don't know if this actually gets really complicated). This will also allow us to build the ANSI state in an as-read basis, which is the same as the terminal. Also, dealing with things like
\033[1;33;41;5m
becomes pretty straightforward.