trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

rm_url() extracts non link strings #34

Closed JuKo007 closed 2 years ago

JuKo007 commented 2 years ago

I recently started working with a text dataset and realized that rm_url() is extracting a non-url pattern from the texts. Here's an example:

rm_url("this is 99% according to craftprotocoll",extract=TRUE)

which extracts

"ftprotocoll"

I did some manual testing and it seems to be the combination of a percentage sign (%) followed by the character string "ftp" that causes this. Not sure if this is a known issue but I thought I'd let you know.

trinker commented 2 years ago

Thank you for the report. This is indeed known and is because the default regex is fairly lightweight by design.

> qdapRegex::grab('@rm_url')
[1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

The default makes the assumption that character sequence ftp will rarely ever appear. If you want something more robust I'd suggest using pattern = "@rm_url2" as seen below:

rm_url("this is 99 according to craftprotocoll",extract=TRUE, pattern = "@rm_url2")
## [[1]]
## [1] NA

The trade off here being that this regex will likely be too strict.