stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.2k stars 885 forks source link

Batch Sizes not used anywhere? Out of mem... #1327

Closed andrePankraz closed 1 month ago

andrePankraz commented 8 months ago

Describe the bug I have some out of mems with 35 GB processes, stanze could be tracked down as reason.

To Reproduce Steps to reproduce the behavior:

  1. Take e.g. stanza.MultilingualPipeline() with self.nlp = stanza.MultilingualPipeline( model_dir=f"{get_from_env('model_dir', 'MODELS_FOLDER', 'data/models/')}stanza", lang_id_config={ "langid_clean_text": True, "langid_lang_subset": ["de", "en"], }, lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False}, "en": {"processors": "tokenize", "verbose": False}, }, use_gpu=False, )
  2. Call self.nlp(lines) with multiple 1000 lines.
  3. LangId processor clusters lines by length, creates tensor and calls LSTM. If one cluster happens to be some 100 lines long (and each line with some complexity), we get the described out of mem.

Expected behavior No out of mem ;) For instance by really using batching in the pipelines?!

The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it). E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors). The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.

Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.

AngledLuffa commented 7 months ago

Would you provide the complete stack trace please?

AngledLuffa commented 7 months ago

Ultimately I would like to be able to recreate the problem, but the following doesn't OOM on a 3090, nowhere near using up all my RAM:

import stanza

pipe = stanza.MultilingualPipeline(lang_id_config={ "langid_clean_text": True,
                                                    "langid_lang_subset": ["de", "en"] },
                                   lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False},
                                                  "en": {"processors": "tokenize", "verbose": False}})

text = "\n\n".join("This is a sample text %d" % i for i in range(10000))
# discarding the result each time
result = pipe(text)

text = "\n".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)

text = "   ".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)
andrePankraz commented 1 month ago

couldn't reproduce either, closing. thx