.mygsub is slow - Githubissues

trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters

Other

426 stars 84 forks source link

.mygsub is slow #124

Closed AugustT closed 2 years ago

AugustT commented 3 years ago

https://github.com/trinker/sentimentr/blob/81224ecf884e9406dad1bebf90914a19b7ee0373/R/utils.R#L182

I'm doing a big analysis and its taking a while, i did some exploring and it turns out that my workflow (which includes processes outside this package), spends half of its time in .mygsub. Can you add perl = true to all calls to gsub? Can you remove loops for applies. Let me know your thoughts, I could have a look into doing a PR myself but I'd like to hear your thoughts first.

@trinker @GitTFJ

trinker commented 2 years ago

@AugustT What OS are you using?

Related: https://github.com/trinker/textclean/issues/51

trinker commented 2 years ago

I have made changes to address this @AugustT. Could you try it out and give feedback?

trinker commented 2 years ago

On windows I get the following using this code but I am wondering what mac users will get:

gsub_reg <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_perl <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)

library(microbenchmark)
library(sentimentr)

y <- hotel_reviews$text

r <- microbenchmark::microbenchmark(
    gsub_reg = gsub_reg(y),
    gsub_perl = gsub_perl(y),
    times = 100
)

plot(r)

trinker commented 2 years ago

Similar results from Mac:

MicrosoftTeams-image

trinker commented 2 years ago

I get similar on Windows with \\p{L} swapped for [:alpha:]:

trinker commented 2 years ago

And with \\p{L} swapped for [:alpha:] on Mac: MicrosoftTeams-image (1)

AugustT commented 2 years ago

Great work. Not that it matters but I was on linux and windows. I think your benchmarking negates the need for me to test. Great stuff! FYI @GitTFJ

trinker commented 2 years ago

One final related:

gsub_reg_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_reg_a <- function(x) gsub("\\d:\\d|\\d\\s|[^[:alpha:]',;: ]", '<<>>', x)
gsub_perl_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)

library(microbenchmark)
library(sentimentr)

y <- hotel_reviews$text

r <- microbenchmark::microbenchmark(
    gsub_reg_p = gsub_reg_p(y),
    gsub_reg_alpha = gsub_reg_a(y),
    gsub_perl_p = gsub_perl_p(y),
    times = 100
)

plot(r)

x<-c("Jeg hader dårlige mennesker.", "Jeg kan godt lide lækker is.", 
"Jeg elsker dig ikke mere; undskyld.")

Inside of make_sentence_df2 I do not use perl = TRUE (see https://github.com/trinker/sentimentr/commit/d24e6af7789c6185ac01b64cdcfcf39532917ea3) because in this case the alpha is faster than \\p{L}and setting it perl = TRUE for [:alpha:] does not retain what we want in the alphabetic nonascii chars:

gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x)
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."

gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x, perl = TRUE)
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."