hadley / r4ds

R for data science: a book
http://r4ds.hadley.nz
Other
4.59k stars 4.22k forks source link

When a literal ] appears in square brackets in a regular expression, base R functions find nothing within the range unless perl=TRUE (R for Data Science could mention this) #1629

Open markpurver opened 9 months ago

markpurver commented 9 months ago

Section 15.4.3 in R for Data Science (https://r4ds.hadley.nz/regexps.html#character-classes) says this about regular expressions: \ escapes special characters, so [\^\-\]] matches ^, -, or ]. But this specific example does not seem to be true when using base R, unless perl=TRUE is chosen (I am using R 4.2.1). The general issue of slight differences between base R and stringr is noted in section 15.7.2, but perhaps this particular quirk is worth mentioning in 15.4.3 as the example contains one of these differences.

For example: grepl("[\\^\\-\\]]", "]") returns FALSE. And: grepl("[\\^\\-\\]]", "^-]") also returns FALSE, indicating that nothing in the range is found in the string. But only the ] symbol appears to cause this. So: grepl("[\\^\\-\\[]", "^-]") returns TRUE, seemingly because the ] is not there (in this example it has been replaced by [ but it could just as well be replaced by nothing).

This issue seems to go away entirely when perl=TRUE is used, so: grepl("[\\^\\-\\]]", "]", perl=TRUE) and grepl("[\\^\\-\\]]", "-", perl=TRUE) both return TRUE.

Perhaps there could to be a note in the book to reflect this, or perhaps it is an issue with base R or the TRE engine.