rm_url() extracts non link strings

trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.

50 stars 4 forks source link

Thank you for the report. This is indeed known and is because the default regex is fairly lightweight by design.

> qdapRegex::grab('@rm_url')
[1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

The default makes the assumption that character sequence ftp will rarely ever appear. If you want something more robust I'd suggest using pattern = "@rm_url2" as seen below:

rm_url("this is 99 according to craftprotocoll",extract=TRUE, pattern = "@rm_url2")
## [[1]]
## [1] NA

The trade off here being that this regex will likely be too strict.

trinker / qdapRegex

rm_url() extracts non link strings #34