trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

Incomplete triming and cleaning of hashtags #14

Closed mobcdi closed 9 years ago

mobcdi commented 9 years ago

When you include the argments clean=TRUE,trim=TRUE its possible to get a result that still includes more than 1 space between tags

Tweet ID: 183298311552372737 Tweet text: Here come our next bunch! Hope we have some brave, chocolate-loving volunteers in this group! #KBSUL #OpenDays2013 http://t.co/7NmtKgWHr7 Function Call: hashtags <-rm_hash(dfWithWeekday$text, extract=TRUE, clean=TRUE, trim=TRUE) Output: [1] "#KBSUL" "#OpenDays2013"

Also if you pass the output of the rm_hash to stringr's str_trim you get * R code returned* hashtags <-str_trim(hashtags,c("both")) gives you "c(\"#KBSUL\", \"#OpenDays2013\")"

trinker commented 9 years ago

Thanks for your question. Can I ask that when you share code GitHb has nice syntax markup. Make your reproducible example so that others can just run your code. Here's a link describing this: https://help.github.com/articles/github-flavored-markdown/

This is because functions in qdapRegex return a list when extract = TRUE. This is because each string may have several extractions (like with hashtags). Use str to look at that type of object is returned.

str(hashtags)

str_trim works on a vector not a list. So you'd have to unlist or use in lapply as shown below. The first returns a vector but destroys the structure. The second returns a list. Also there should be no need for str_trim as qdapRegex does this already. The output is the same before and after.

x <- "Tweet text: Here come our next bunch! Hope we have some brave, chocolate-loving volunteers in this group! #KBSUL #OpenDays2013 http://t.co/7NmtKgWHr7"
(hashtags <- rm_hash(x, extract=TRUE, clean=TRUE, trim=TRUE))

(hashtags <- stringr::str_trim(unlist(hashtags),c("both")))
(hashtags <- lapply(hashtags, stringr::str_trim, c("both")))
mobcdi commented 9 years ago

Sorry I forgot to mark it up but have done it now. Thanks for helping out so quickly