trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

`rm_between` same left right boundaries gives undesired output #13

Closed trinker closed 9 years ago

trinker commented 9 years ago

Determine if the following is a bug and if so how to fix:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

rm_between(
    x, 
    left = '"', right = '"',
    extract=TRUE
)

## [[1]]
## [1] "Salmo salar"       "and Danube salmon" "Hucho hucho"  

When we expect:

## [[1]]
## [1] "\"Salmo salar\"" "\"Hucho hucho\""
trinker commented 9 years ago

I think this is because the default regex of rm_between is to not include the left/right bounds. This uses the following regex "(?<=\").*?(?=\")" (S("@rm_between2", '"')). This use of lookaheads cause the left/right bounds to not be consumed and thus allows the quotation marks to be available for: " and Danube salmon ". This is (IMO) a bug that I will address but am unsure how yet.

trinker commented 9 years ago

@hwnd you suggested:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

rm_default(
    x, 
    pattern = '(?<=")[^"]*',
    extract=TRUE
)

But this gives:

## [[1]]
## [1] "Salmo salar"         " and Danube salmon " "Hucho hucho"         ""
``

Not:

```r
## [[1]]
## [1] "Salmo salar" "Hucho hucho"
hwndx commented 9 years ago

In the case of quotes, lookarounds should be avoided because of the "in between".

One possible workaround would be:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

gsub('^"|"$', '', 
   rm_default(
       x, 
       pattern = '"[^"]*"', 
       extract=TRUE)[[1]]
   )

Output

## [1] "Salmo salar" "Hucho hucho"
trinker commented 9 years ago

@hwndx I incorporated your idea into rm_between. Thanks for the help.