stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.22k stars 886 forks source link

MemoryError after upgrade to stanza #293

Closed mehmetilker closed 4 years ago

mehmetilker commented 4 years ago

Describe the bug After replace stanfordnlp with stanza I am experiencing disk usage & memory increase. Additionaly CPU usage looks more stable.

Expected behavior As I changed old library with the new one with only PyTorch upgrade (1.4 > 1.5) I expect little or no change

Environment (please complete the following information):

Additional context There is a service continuously parsing some text and after some time later throws exception. I am using stanza with spacy_stanza (previously spacy_stanfornlp), when I increase batch size (pipe) I experience problem more often.

Pretrained file exists but cannot be loaded from /home/user/stanza_resources/tr/pretrain/imst.pt, due to the following exception:

11:15:12.065 -  ERROR - run_jobs.py               - run_jobs_in_order - Unexpected error: Traceback (most recent call last):
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/common/pretrain.py", line 45, in load
    data = torch.load(self.filename, lambda storage, loc: storage)
  File "/home/proj_home/.env/lib/python3.8/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/proj_home/.env/lib/python3.8/site-packages/torch/serialization.py", line 773, in _legacy_load
    result = unpickler.load()
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_jobs.py", line 116, in run_jobs_in_order
    parse_content(batch_id)
  File "run_jobs.py", line 31, in parse_content
    all(batch_id)
  File "/home/proj_home/common/PerfUtils.py", line 56, in _wrapper
    result = f(*args, **kwargs)
  File "/home/proj_home/runparse.py", line 165, in all
    nlp = CustomParser.loadParser(includeDepParse=True)
  File "/home/proj_home/common/PerfUtils.py", line 56, in _wrapper
    result = f(*args, **kwargs)
  File "/home/proj_home/services/parse/api/CustomParserStanza.py", line 48, in loadParser
    snlp = stanza.Pipeline(**config)
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/pipeline/core.py", line 121, in __init__
    self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/pipeline/processor.py", line 103, in __init__
    self._set_up_model(config, use_gpu)
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/pipeline/pos_processor.py", line 25, in _set_up_model
    self._trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], use_cuda=use_gpu)
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/pos/trainer.py", line 35, in __init__
    self.load(model_file, pretrain)
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/pos/trainer.py", line 118, in load
    emb_matrix = pretrain.emb
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/common/pretrain.py", line 39, in emb
    self._vocab, self._emb = self.load()
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/common/pretrain.py", line 50, in load
    return self.read_pretrain()
  File "/home/proj_home/.env/lib/python3.8/site-packages/stanza/models/common/pretrain.py", line 58, in read_pretrain
    raise Exception("Vector file is not provided.")
Exception: Vector file is not provided.

You can see changes by the red line: image

yuhui-zh15 commented 4 years ago

@mehmetilker Is NER included in your Stanza pipeline? If so, it is not fair to compare with stanfordnlp, as NER is the new feature in Stanza. While our NER model can achieve SOTA results, it is featured with contextualized word embedding generated by character-level RNN, which requires significant computational resources and favors for GPU.

Can you disable the NER processor and compare it with stanfordnlp again? Thanks!

mehmetilker commented 4 years ago

@yuhui-zh15 There is no NER model for Turkish language and I am using 'tokenize,mwt,pos,lemma,depparse' processors. In this way NER is already disabled I guess.

yuhui-zh15 commented 4 years ago

@mehmetilker Can you provide the script you used? In that case, we can understand the problem more quickly!

mehmetilker commented 4 years ago

@yuhui-zh15 my mistake. I have found the reason for disk io problem, nothing related with stanza. Memory increase still there. I will try to replace with sample. Until then I am closing the issue.

DesiPilla commented 3 years ago

@yuhui-zh15 my mistake. I have found the reason for disk io problem, nothing related with stanza. Memory increase still there. I will try to replace with sample. Until then I am closing the issue.

Can you share the solution you found? I am experiencing the same issue.

qipeng commented 3 years ago

@mehmetilker @DesiPilla in general, if you're seeing memory errors associated with stack traces that look like model loading, that probably means your memory is too small to load all of the Stanza models you need at once. If you're in a VM or docker environment, increasing the memory limit would help; otherwise you can also try to process the text one step at a time: stanza processors have flags like tokneize_pretokenized and depparse_pretagged that take the output from previous stages without recomputing them. See the documentation for processors for more details!