louismullie / stanford-core-nlp

Ruby bindings to the Stanford Core NLP tools (English, French, German).
Other
433 stars 70 forks source link

Memory Leaks #16

Closed emmx closed 11 years ago

emmx commented 11 years ago

Hello,

I've found an important memory leak somewhere in StanfordCoreNLP::Annotation. Check the memory consumption of this code (more than 600MB as soon as you execute it):

require 'stanford-core-nlp'

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)

loop do
    text = StanfordCoreNLP::Annotation.new('This is an example...')
end

Consider that since you're reusing the variable "text", you no longer have a reference to the previous object so it should be removed by the ruby GC. However, it's continuously increasing its memory usage.

It's important since some scripts are supposed to run all day, but with problems like this one it's impossible to keep them running for even an hour!!

Any workaround?

Thanks, Matt

louismullie commented 11 years ago

The line pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref) does a few things:

  1. Loads the JVM.
  2. Loads the JARs.
  3. Loads the models.

In this case, since you're loading all annotators, you're looking at ~200 MB of loaded files for steps 2-3 (and all of these are compressed files that are inflated when loaded into the RAM). That's what explains your baseline memory usage. Consistent with this fairly large memory footprint, the default StanfordCoreNLP.jvm_args, are set to ['-Xms512M', '-Xmx1024M'].

If the pipeline object were just a plain old Ruby object, I think you would be right to say it shouldn't be persisted. However, since it's in fact an Rjb::JavaProxy object, I'm not sure you can make any assumptions about GC here. Note that if you're going to do anything with the Annotation objects you are creating inside the loop, you'll need the pipeline - so there's no point in erasing it from memory.

Same principle for the loop - depending on how the Rjb GC works, you might end up accumulating live Annotation instances, continually increasing the memory footprint.

I don't know the ins and outs of Rjb GC principles and you might want to ask @arton (the maker of Rjb) for further details (provided you can reproduce the problem with a simple enough example). You may also want to look at JRuby.

Perhaps if you provide further details on what you are trying to accomplish I can help further, and propose an alternative way of functioning.

emmx commented 11 years ago

Alright, thanks for your reply.

Just to clarify a little bit... it doesn't make much sense since I removed part of the code, but here is how the code looks like:

pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma)
loop do
   #get some text to process
   text = get_some_text_from_db
   text = StanfordCoreNLP::Annotation.new(text)
   #convert text into a string of lemmas
   lemmas = []
   text.get(:sentences).each do |sentence|
         sentence.get(:tokens).each do |token|
                lemmas.push token.get(:lemma)
         end
   end
   lemmas = lemmas.join ' '
   #process all tags associated with the lemmas
   lemmas = StanfordCoreNLP::Annotation.new(lemmas)
   lemmas.get(:tokens).each do |token|
         process token.get(:part_of_speech)
   end
end

As you can see there are two StanfordCoreNLP::Annotation.new (I think it's not possible to use the same object I created previously, is it?). The idea is that I needed to process all part-of-speech tags of a string with singular/normalized words. For example, given the text "This is an example with a plural word: computers", I convert "computers" inito "computer" and get a tag NN instead of NNS.

I couldn't find any simpler way to do so, maybe it's overkill (specially based on the amount of memory it requires)...

emmx commented 11 years ago

I found in the documentation of Stanford NLP that there are some flags you can set to change a little bit the behavior of the algorithms and increase the performance. However, reading the code (particularly, load and load_class) I concluded there's no way to create an annotator with custom properties such as the ones listed for this tokenizer: http://www.jarvana.com/jarvana/view/edu/stanford/nlp/stanford-corenlp/1.2.0/stanford-corenlp-1.2.0-javadoc.jar!/index.html?edu/stanford/nlp/process/PTBTokenizer.html

Am I wrong? Do you support passing properties to constructors?

louismullie commented 11 years ago

There is no support for that out of the box, but here's how it can be accomplished:

StanfordCoreNLP.load_class('StringReader', 'java.io')
StanfordCoreNLP.load_class('WordTokenFactory', 'edu.stanford.nlp.process')
StanfordCoreNLP.load_class('PTBTokenizer', 'edu.stanford.nlp.process')

options = 'ptb3Escaping=false'
reader = StanfordCoreNLP::StringReader.new('A string to tokenize.')
factory = StanfordCoreNLP::WordTokenFactory.new
tokenizer = StanfordCoreNLP::PTBTokenizer.new(reader, factory, options)

while tokenizer.has_next
  token = tokenizer.next
  puts token.to_s
end
emmx commented 11 years ago

Alright, it's a little bit better now. Thank you!