Closed MichaelChirico closed 1 year ago
That looks like a bug, no? The control character is not embedded in the source code AFAICT, so it should be kept as "\1"
I think.
Might be a utils::getParseData()
bug?
utils::getParseData(parse(text = '"\\1"'))
# line1 col1 line2 col2 id parent token terminal text
# 1 1 1 1 4 1 3 STR_CONST TRUE "\\"
# 3 1 1 1 4 3 0 expr FALSE
Yeah, it is. Seems like the three-digit version always works:
❯ utils::getParseData(parse(text = '"\\01"'))
line1 col1 line2 col2 id parent token terminal text
1 1 1 1 5 1 3 STR_CONST TRUE "\\0"
3 1 1 1 5 3 0 expr FALSE
❯ utils::getParseData(parse(text = '"\\001"'))
line1 col1 line2 col2 id parent token terminal text
1 1 1 1 6 1 3 STR_CONST TRUE "\\001"
3 1 1 1 6 3 0 expr FALSE
Do you want to report it? Should we try to work around it in xmlparsedata?
I'll report it... not sure the full scope of the problem... but it seems to affect all octal strings:
for (width in 1:3) {
for (digit in 1:7)
cat(nchar(utils::getParseData(parse(text = sprintf('"\\%0*d"', width, digit)))$text[1L]))
cat("\n")
}
# 3333333
# 4444444
# 6666666
The three digit ones look correct, but the one- and two-digit ones are truncated.
It is probably possible to work around this in xmlparsedata, if we wanted to. We would need to look for STR_CONST
rows where nchar(text)
is shorter than col2 - col1 + 1
. Then we would need to re-parse these strings with some custom parser.
Let me know if you need the workaround for this. A PR is also welcome. (But no pressure, really.)
Then we would need to re-parse these strings with some custom parser.
This is the headache part... not sure how worth it it is to maintain that. Though I guess it's as "easy" as starting from col1
and finding the balanced quote. Modulo R"()"
strings...
No need to find quotes, we only "re-parse" the STR_CONST
, at the known line and col coordinates. If we find a \
followed by one or two digits, then we fix it. (And on R versions that don't have this bug, we do not work around at all.)
Still, it is not trivial at all, and it might not be worth it.
It looks like the fix for R itself is two lines, though I feel like I've just pulled out some Jenga blocks and am waiting for it to crash:
No need to find quotes, we only "re-parse" the STR_CONST, at the known line and col coordinates. If we find a \ followed by one or two digits, then we fix it. (And on R versions that don't have this bug, we do not work around at all.)
I think not quite, because the \
can come anywhere in the string, e.g. for
"ab\1"
"a\1b"
"\1ab"
The STR_CONST
data will all be the same. So we do have to parse the string to find the guilty characters.
I also don't know if there's anything to be done in a case like xml_parse_data(utils::getParseData(parse(text = "'\\1'")))
where srcref
is not available -- that would be most relevant to lintr
IINM.
Hmm, OTOH, I don't think it's possible to declare an octal-escaped character in raw string mode, so maybe it's still doable:
R"(\1)"
# [1] "\\1"
First of all, I think raw strings can be ignored, they are fine. So if text
starts with R
then we have nothing to do.
Otherwise, for each STR_CONST
we check if the length of text
is fine, and if it is not, then we just take the text from the input at the specified positions. text
is the raw input, not the parsed bytes:
❯ utils::getParseData(parse(text = "'\\u00a0'"))$text[[1]]
[1] "'\\u00a0'"
❯ charToRaw(utils::getParseData(parse(text = "'\\u00a0'"))$text[[1]])
[1] 27 5c 75 30 30 61 30 27
So, unless I am mistaken, this is simple?
I think we came to the same conclusion, see PR... now my concern is how this mixes with \t
because of https://bugs.r-project.org/show_bug.cgi?id=18114 :monocle_face:
Oh, wow. I didn't know that tabs change the positions, what is the point of that? Yeah, that can be a problem.
We should probably do the same thing as lintr? Does lintr call xmlparsedata with the un-TAB-ified parse data? (In that case lintr would also need the fix for \1
.)
I didn't know that tabs change the positions, what is the point of that?
I guess it imitates how tabs usually work... "bump position to next tab marker". what's weird is each tab has width 8, which is somewhat insane IMO.
We should probably do the same thing as lintr? Does lintr call xmlparsedata with the un-TAB-ified parse data? (In that case lintr would also need the fix for \1.)
Looking into this now, I don't know off-hand.
lintr
applies fix_tab_indentations()
and fix_eq_assigns()
to the parse data before passing it to xml_parse_data()
:
We could just copy those over here (both are MIT licensed), WDYT?
We are writing a linter to find non-regex regexes which involves examining the STR_CONST passed to functions like
gsub()
,strsplit()
, etc:https://github.com/r-lib/lintr/pull/1032
The linter threw an error in a rare usage here:
https://github.com/ggrothendieck/gsubfn/blob/66d52d9c0db277d39d3502d1db74911e8257fb64/R/gsubfn.R#L163
Where the relevant R code is:
the character literal here shows up in the XML parse data as:
My guess is this is similar to #19... is there anything to do here besides resort to using the R code itself?