Closed emmx closed 11 years ago
The line pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
does a few things:
In this case, since you're loading all annotators, you're looking at ~200 MB of loaded files for steps 2-3 (and all of these are compressed files that are inflated when loaded into the RAM). That's what explains your baseline memory usage. Consistent with this fairly large memory footprint, the default StanfordCoreNLP.jvm_args
, are set to ['-Xms512M', '-Xmx1024M']
.
If the pipeline object were just a plain old Ruby object, I think you would be right to say it shouldn't be persisted. However, since it's in fact an Rjb::JavaProxy
object, I'm not sure you can make any assumptions about GC here. Note that if you're going to do anything with the Annotation
objects you are creating inside the loop, you'll need the pipeline - so there's no point in erasing it from memory.
Same principle for the loop - depending on how the Rjb GC works, you might end up accumulating live Annotation
instances, continually increasing the memory footprint.
I don't know the ins and outs of Rjb GC principles and you might want to ask @arton (the maker of Rjb) for further details (provided you can reproduce the problem with a simple enough example). You may also want to look at JRuby.
Perhaps if you provide further details on what you are trying to accomplish I can help further, and propose an alternative way of functioning.
Alright, thanks for your reply.
Just to clarify a little bit... it doesn't make much sense since I removed part of the code, but here is how the code looks like:
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma) loop do #get some text to process text = get_some_text_from_db text = StanfordCoreNLP::Annotation.new(text) #convert text into a string of lemmas lemmas = [] text.get(:sentences).each do |sentence| sentence.get(:tokens).each do |token| lemmas.push token.get(:lemma) end end lemmas = lemmas.join ' ' #process all tags associated with the lemmas lemmas = StanfordCoreNLP::Annotation.new(lemmas) lemmas.get(:tokens).each do |token| process token.get(:part_of_speech) end end
As you can see there are two StanfordCoreNLP::Annotation.new (I think it's not possible to use the same object I created previously, is it?). The idea is that I needed to process all part-of-speech tags of a string with singular/normalized words. For example, given the text "This is an example with a plural word: computers", I convert "computers" inito "computer" and get a tag NN instead of NNS.
I couldn't find any simpler way to do so, maybe it's overkill (specially based on the amount of memory it requires)...
I found in the documentation of Stanford NLP that there are some flags you can set to change a little bit the behavior of the algorithms and increase the performance. However, reading the code (particularly, load and load_class) I concluded there's no way to create an annotator with custom properties such as the ones listed for this tokenizer: http://www.jarvana.com/jarvana/view/edu/stanford/nlp/stanford-corenlp/1.2.0/stanford-corenlp-1.2.0-javadoc.jar!/index.html?edu/stanford/nlp/process/PTBTokenizer.html
Am I wrong? Do you support passing properties to constructors?
There is no support for that out of the box, but here's how it can be accomplished:
StanfordCoreNLP.load_class('StringReader', 'java.io')
StanfordCoreNLP.load_class('WordTokenFactory', 'edu.stanford.nlp.process')
StanfordCoreNLP.load_class('PTBTokenizer', 'edu.stanford.nlp.process')
options = 'ptb3Escaping=false'
reader = StanfordCoreNLP::StringReader.new('A string to tokenize.')
factory = StanfordCoreNLP::WordTokenFactory.new
tokenizer = StanfordCoreNLP::PTBTokenizer.new(reader, factory, options)
while tokenizer.has_next
token = tokenizer.next
puts token.to_s
end
Alright, it's a little bit better now. Thank you!
Hello,
I've found an important memory leak somewhere in StanfordCoreNLP::Annotation. Check the memory consumption of this code (more than 600MB as soon as you execute it):
Consider that since you're reusing the variable "text", you no longer have a reference to the previous object so it should be removed by the ruby GC. However, it's continuously increasing its memory usage.
It's important since some scripts are supposed to run all day, but with problems like this one it's impossible to keep them running for even an hour!!
Any workaround?
Thanks, Matt