mjockers / syuzhet

An R package for the extraction of sentiment and sentiment-based plot arcs from text
335 stars 72 forks source link

issues with get_nrc_sentiment #36

Open sutravekruttika opened 3 years ago

sutravekruttika commented 3 years ago

Hi, I am trying to perform sentiment analysis using the NRC lexicon on Twitter data however when I use get_nrc_sentiment it takes too long to compute. I do have a huge dataset.

How can I reduce the time consumption? Please advise. Also, I am new to R. Thank you.

FelixPeckitt commented 3 years ago

Hi thanks for raising your issue, welcome to using R! Do you have a sample of code shows the issue you are facing?

sutravekruttika commented 3 years ago

I am using the following code. I have about a million tweets. f_clean_tweets <- function (tweets) {

clean_tweets = gsub('(RT|via)((?:\b\W@\w+)+)', '', tweets) clean_tweets = gsub('@\w+', '', clean_tweets) clean_tweets = gsub('[[:punct:]]', '', clean_tweets) clean_tweets = gsub('[[:digit:]]', '', clean_tweets) clean_tweets = gsub('http\w+', '', clean_tweets) clean_tweets = gsub('[ \t]{2,}', '', clean_tweets) clean_tweets = gsub('^\s+|\s+$', '', clean_tweets) clean_tweets = gsub('<.>', '', enc2native(clean_tweets)) clean_tweets = tolower(clean_tweets)

clean_tweets }

text_data = df_new$text clean_tweets <- f_clean_tweets(text_data) emotions <- get_nrc_sentiment(clean_tweets)

FelixPeckitt commented 3 years ago

Thanks for the code sample. So I’m assuming the cleansing is working fine, and it’s the get_nrc_sentiment that is taking up the most time - is that correct, and you can run the code on a subset of your million tweets?

depending on what machine you are running your code on, you could partition the tweets into different groups, perhaps by starting letter or range of letters, then run this in parallel. https://www.r-bloggers.com/2017/10/running-r-code-in-parallel/ Alternatively, the simple approach of you are struggling to find the hardware would be to run a partition one at a time, saving the results to file or workspace, then combining afterwards. This would have the advantage verifying that your code is running, but would require more effort on your part.

Apart from running this code on a more powerful cloud instance, all I can suggest is leaving it to run overnight.

i hope this helps!

sutravekruttika commented 3 years ago

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

FelixPeckitt commented 3 years ago

No problem. If you come up with something that helps, do post a snippet back so it can help others

luisignaciomenendez commented 3 years ago

Yes, the cleansing is fine. get_nrc_sentiment took hours to complete on a subset of my data(~200k tweets). I got the results when I left it to run for a couple of hours. Looks like I will just repeat this process on small chunks of data. Thank you for pointing me in the right direction.

Hello

I am facing the same issues here. My data consists of 340.000 tweets approx. and I am trying to use the get_nrc_sentiment on it. However, I have timed the code to see an estimation of the total time (as I left it overnight but it didnt finish), and I get that it would last about 14 days in my case.

As you mentioned it took some hours in your comment I wondered if there is something wrong or maybe if someone came up with a solution ( also whether someone has tried parallelisation successfully)? Is it normal to last that much?

This is my current code: emotions = get_nrc_sentiment(blm2$stripped_text) which takes 3.6 seconds per tweet.

Thanks in advance