Closed JuKo007 closed 2 years ago
Thank you for the report. This is indeed known and is because the default regex is fairly lightweight by design.
> qdapRegex::grab('@rm_url')
[1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"
The default makes the assumption that character sequence ftp will rarely ever appear. If you want something more robust I'd suggest using pattern = "@rm_url2"
as seen below:
rm_url("this is 99 according to craftprotocoll",extract=TRUE, pattern = "@rm_url2")
## [[1]]
## [1] NA
The trade off here being that this regex will likely be too strict.
I recently started working with a text dataset and realized that rm_url() is extracting a non-url pattern from the texts. Here's an example:
rm_url("this is 99% according to craftprotocoll",extract=TRUE)
which extracts
"ftprotocoll"
I did some manual testing and it seems to be the combination of a percentage sign (%) followed by the character string "ftp" that causes this. Not sure if this is a known issue but I thought I'd let you know.