Closed AugustT closed 2 years ago
@AugustT What OS are you using?
I have made changes to address this @AugustT. Could you try it out and give feedback?
On windows I get the following using this code but I am wondering what mac users will get:
gsub_reg <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_perl <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)
library(microbenchmark)
library(sentimentr)
y <- hotel_reviews$text
r <- microbenchmark::microbenchmark(
gsub_reg = gsub_reg(y),
gsub_perl = gsub_perl(y),
times = 100
)
plot(r)
Similar results from Mac:
I get similar on Windows with \\p{L}
swapped for [:alpha:]
:
And with \\p{L}
swapped for [:alpha:]
on Mac:
Great work. Not that it matters but I was on linux and windows. I think your benchmarking negates the need for me to test. Great stuff! FYI @GitTFJ
One final related:
gsub_reg_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_reg_a <- function(x) gsub("\\d:\\d|\\d\\s|[^[:alpha:]',;: ]", '<<>>', x)
gsub_perl_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)
library(microbenchmark)
library(sentimentr)
y <- hotel_reviews$text
r <- microbenchmark::microbenchmark(
gsub_reg_p = gsub_reg_p(y),
gsub_reg_alpha = gsub_reg_a(y),
gsub_perl_p = gsub_perl_p(y),
times = 100
)
plot(r)
x<-c("Jeg hader dårlige mennesker.", "Jeg kan godt lide lækker is.",
"Jeg elsker dig ikke mere; undskyld.")
Inside of make_sentence_df2
I do not use perl = TRUE
(see https://github.com/trinker/sentimentr/commit/d24e6af7789c6185ac01b64cdcfcf39532917ea3) because in this case the alpha is faster than \\p{L}
and setting it perl = TRUE
for [:alpha:]
does not retain what we want in the alphabetic nonascii chars:
gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x)
## [1] "Jeg hader dårlige mennesker." "Jeg kan godt lide lækker is."
## [3] "Jeg elsker dig ikke mere; undskyld."
gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x, perl = TRUE)
## [1] "Jeg hader dårlige mennesker." "Jeg kan godt lide lækker is."
## [3] "Jeg elsker dig ikke mere; undskyld."
https://github.com/trinker/sentimentr/blob/81224ecf884e9406dad1bebf90914a19b7ee0373/R/utils.R#L182
I'm doing a big analysis and its taking a while, i did some exploring and it turns out that my workflow (which includes processes outside this package), spends half of its time in .mygsub. Can you add perl = true to all calls to gsub? Can you remove loops for applies. Let me know your thoughts, I could have a look into doing a PR myself but I'd like to hear your thoughts first.
@trinker @GitTFJ