trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters
Other
427 stars 84 forks source link

High memory consumption when using sentiment() function #39

Open contefranz opened 7 years ago

contefranz commented 7 years ago

I am running some polarity computation through the function sentiment(). What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:

Error in ` [.data.table ` (word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted

A character vector of 669 kB (computed through object_size() in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.

I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.

Do you have any hints or are you aware of this issue? I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.

Thanks

trinker commented 7 years ago

Can you make both parts of this reproducible. The stringi package has tools to generate andom text that you can use to mimic the data you're talking about.

contefranz commented 7 years ago

Thank you for the hint. Below you can find the minimal.

# minimal example
rm( list = ls() )
gc( reset = T )

library( pryr )
library( stringi )
library( data.table )
library( sentimentr )

# generating some paragraphs of random text and make them flat
set.seed( 2017 )
text = stri_flatten( stri_rand_lipsum( 50000 ), " " )

object_size( text )
object.size( text )

# computing tone
tone = sentiment( text )

object_size( tone )
object.size( tone )

The profiler run through profvis::profvis() says that that memory went up to 4.153 GB despite an initial object (text) of just 6 MB. Unfortunately, I can't upload the screenshot. Could you please run this and see what is happening? My problem is even worse since some text are above 50 MB and when I compute the tone the RAM can reach 600 GB. This forces the job to be killed right away even if the workstation is really powerful.

Below you find my session infos. Thank you again.

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sentimentr_1.0.0  data.table_1.10.4 stringi_1.1.5     pryr_0.1.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10     codetools_0.2-15 digest_0.6.12    jsonlite_1.4     magrittr_1.5     syuzhet_1.0.1    textclean_0.3.1  tools_3.4.0     
 [9] stringr_1.2.0    htmlwidgets_0.8  yaml_2.1.14      compiler_3.4.0   lexicon_0.3.1    htmltools_0.3.6  profvis_0.3.3   
trinker commented 7 years ago

sentimentr works at the sentence level. So in the example you provide a split into sentences produced ~500K sentences. This runs for me but will certainly consume a bit of memory. There may be ways to improve sentimentr's memory consumption but I have not found a way. If someone sees this and sees a way to make sentimentr more memory efficient a PR is welcomed. I used data.table for speed reasons, not memory. I'm guessing there are ways to improve my code in this respect.

Until then my suggestion is to chunk the text and loop through with a manual memory release (gc) after each iteration.

My second thought is that perhaps sentimentr isn't the right tool for this job. I state in the README that the tool is designed to balance the tradeoff between accuracy and speed. I don't address memory but if you're chugging through that much text you're going to have to balance your own trade offs. I evaluate a number of sentiment tools in the package README. One tool I evaluate is Drew Schmidt's meanr (https://github.com/wrathematics/meanr). This is written in low level C and is very fast. It should be memory efficient as well. His work is excellent and specifically targeted at the type of analysis you seem to be doing. This might be the better choice. Both of our packages have READMEs that explain the package philosophies/goals very well. I think starting there and asking if you care about the added accuracy of sentimentr enough to chunk your text and loop through it. If not it's not the tool for this task.

That being said I want to leave this issue open if any community members want to look through the code and optimize memory usage the improvement would be welcomed.

contefranz commented 7 years ago

Thank you for the precious answer. You provide good reasons so I'll check the meanr package as you suggest. What puzzles me though, is that data.table has been conceived not only for speed, but also memory efficiency. Its "by-reference" paradigm aims specifically at minimising all internal copies which are very common under the R environment.

Anyway, I suspect you are right. My texts contain many sentences, sometimes even more than it should because of HTML tagging and other fancy stuff.

I will try to keep this updated and when I will have time, it could be worthwhile to take a look at the internals of sentimentr.

Thank you again.

trinker commented 7 years ago

What puzzles me though, is that data.table has been conceived not only for speed, but also memory efficiency. Its "by-reference" paradigm aims specifically at minimising all internal copies which are very common under the R environment.

I suspect a true data.table whizz would see how to optimize this (@mattdowle would likely feel sick if he sees how I've used data.table). So I'm saying let's assume the issue is my misuse of data.table not data.table itself.

contefranz commented 7 years ago

data.table works like magic. No doubt about this. Stop. The only suggestion I can give you is to carefully profile your function which I saw exploits many other internal functions. For someone who did not develop the code, it is hard to see issues, but I think for you should be much easier.

MarcoDVisser commented 7 years ago

Referred here from another forum by Trinker.

Profiling should give you what is consuming the most memory, here is a quick guide: https://github.com/MarcoDVisser/aprof#memory-statisics

[This is on condition that you aren't working in a lower-level language].

I' ll be happy to help think what is causing the "high consumption".

M

trinker commented 7 years ago

Not surprisingly...the comma_reducer is causing huge memory use.

Per Marco's aprof:

image

MarcoDVisser commented 7 years ago

Hi trinker,

Looking at https://github.com/trinker/sentimentr/blob/master/R/utils.R

I see a bunch of potential problems (e.g. the potential use of non vectorized ifelse statements), which may in fact not be problems at all. It all depends on how these functions are used, and how they are "fed" data. Hence, we would need more detailed profiling.

Would you mind running the targetedSummary function on line 262? https://www.rdocumentation.org/packages/aprof/versions/0.3.2/topics/targetedSummary

As you appear to use data.table, I'll be interested to see which functions are consuming so much memory.

M.

trinker commented 7 years ago

https://github.com/trinker/sentimentr/issues/46 may reduce some memory consumption: