r-lib / xmlparsedata

R code parse data as an XML tree
https://r-lib.github.io/xmlparsedata/
Other
23 stars 7 forks source link

"\1" makes unparseable string in STR_CONST #25

Closed MichaelChirico closed 1 year ago

MichaelChirico commented 2 years ago

We are writing a linter to find non-regex regexes which involves examining the STR_CONST passed to functions like gsub(), strsplit(), etc:

https://github.com/r-lib/lintr/pull/1032

The linter threw an error in a rare usage here:

https://github.com/ggrothendieck/gsubfn/blob/66d52d9c0db277d39d3502d1db74911e8257fb64/R/gsubfn.R#L163

Where the relevant R code is:

strsplit(z, "\1")

the character literal here shows up in the XML parse data as:

<STR_CONST line1="1" col1="13" line2="1" col2="16" start="31" end="34">"\"</STR_CONST>

My guess is this is similar to #19... is there anything to do here besides resort to using the R code itself?

gaborcsardi commented 2 years ago

That looks like a bug, no? The control character is not embedded in the source code AFAICT, so it should be kept as "\1" I think.

MichaelChirico commented 2 years ago

Might be a utils::getParseData() bug?

utils::getParseData(parse(text = '"\\1"'))
#   line1 col1 line2 col2 id parent     token terminal text
# 1     1    1     1    4  1      3 STR_CONST     TRUE "\\"
# 3     1    1     1    4  3      0      expr    FALSE     
gaborcsardi commented 2 years ago

Yeah, it is. Seems like the three-digit version always works:

❯ utils::getParseData(parse(text = '"\\01"'))
  line1 col1 line2 col2 id parent     token terminal  text
1     1    1     1    5  1      3 STR_CONST     TRUE "\\0"
3     1    1     1    5  3      0      expr    FALSE

❯ utils::getParseData(parse(text = '"\\001"'))
  line1 col1 line2 col2 id parent     token terminal    text
1     1    1     1    6  1      3 STR_CONST     TRUE "\\001"
3     1    1     1    6  3      0      expr    FALSE

Do you want to report it? Should we try to work around it in xmlparsedata?

MichaelChirico commented 2 years ago

I'll report it... not sure the full scope of the problem... but it seems to affect all octal strings:

for (width in 1:3) {
for (digit in 1:7)
cat(nchar(utils::getParseData(parse(text = sprintf('"\\%0*d"', width, digit)))$text[1L]))
cat("\n")
}
# 3333333
# 4444444
# 6666666
gaborcsardi commented 2 years ago

The three digit ones look correct, but the one- and two-digit ones are truncated.

gaborcsardi commented 2 years ago

It is probably possible to work around this in xmlparsedata, if we wanted to. We would need to look for STR_CONST rows where nchar(text) is shorter than col2 - col1 + 1. Then we would need to re-parse these strings with some custom parser.

MichaelChirico commented 2 years ago

Posted: https://bugs.r-project.org/show_bug.cgi?id=18323

gaborcsardi commented 2 years ago

Let me know if you need the workaround for this. A PR is also welcome. (But no pressure, really.)

MichaelChirico commented 2 years ago

Then we would need to re-parse these strings with some custom parser.

This is the headache part... not sure how worth it it is to maintain that. Though I guess it's as "easy" as starting from col1 and finding the balanced quote. Modulo R"()" strings...

gaborcsardi commented 2 years ago

No need to find quotes, we only "re-parse" the STR_CONST, at the known line and col coordinates. If we find a \ followed by one or two digits, then we fix it. (And on R versions that don't have this bug, we do not work around at all.)

Still, it is not trivial at all, and it might not be worth it.

MichaelChirico commented 2 years ago

It looks like the fix for R itself is two lines, though I feel like I've just pulled out some Jenga blocks and am waiting for it to crash:

https://bugs.r-project.org/show_bug.cgi?id=18323

MichaelChirico commented 2 years ago

No need to find quotes, we only "re-parse" the STR_CONST, at the known line and col coordinates. If we find a \ followed by one or two digits, then we fix it. (And on R versions that don't have this bug, we do not work around at all.)

I think not quite, because the \ can come anywhere in the string, e.g. for

"ab\1"
"a\1b"
"\1ab"

The STR_CONST data will all be the same. So we do have to parse the string to find the guilty characters.

I also don't know if there's anything to be done in a case like xml_parse_data(utils::getParseData(parse(text = "'\\1'"))) where srcref is not available -- that would be most relevant to lintr IINM.

MichaelChirico commented 2 years ago

Hmm, OTOH, I don't think it's possible to declare an octal-escaped character in raw string mode, so maybe it's still doable:

R"(\1)"
# [1] "\\1"
gaborcsardi commented 2 years ago

First of all, I think raw strings can be ignored, they are fine. So if text starts with R then we have nothing to do.

Otherwise, for each STR_CONST we check if the length of text is fine, and if it is not, then we just take the text from the input at the specified positions. text is the raw input, not the parsed bytes:

❯ utils::getParseData(parse(text = "'\\u00a0'"))$text[[1]]
[1] "'\\u00a0'"

❯ charToRaw(utils::getParseData(parse(text = "'\\u00a0'"))$text[[1]])
[1] 27 5c 75 30 30 61 30 27

So, unless I am mistaken, this is simple?

MichaelChirico commented 2 years ago

I think we came to the same conclusion, see PR... now my concern is how this mixes with \t because of https://bugs.r-project.org/show_bug.cgi?id=18114 :monocle_face:

gaborcsardi commented 2 years ago

Oh, wow. I didn't know that tabs change the positions, what is the point of that? Yeah, that can be a problem.

We should probably do the same thing as lintr? Does lintr call xmlparsedata with the un-TAB-ified parse data? (In that case lintr would also need the fix for \1.)

MichaelChirico commented 2 years ago

I didn't know that tabs change the positions, what is the point of that?

I guess it imitates how tabs usually work... "bump position to next tab marker". what's weird is each tab has width 8, which is somewhat insane IMO.

We should probably do the same thing as lintr? Does lintr call xmlparsedata with the un-TAB-ified parse data? (In that case lintr would also need the fix for \1.)

Looking into this now, I don't know off-hand.

MichaelChirico commented 2 years ago

lintr applies fix_tab_indentations() and fix_eq_assigns() to the parse data before passing it to xml_parse_data():

https://github.com/r-lib/lintr/blob/49fd3468a5ec8fb14d5986809b3ca22511ab5ef7/R/get_source_expressions.R#L392

We could just copy those over here (both are MIT licensed), WDYT?