stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

What are the biggest performance bottlenecks? #1390

Open sambaPython24 opened 2 months ago

sambaPython24 commented 2 months ago

Hey, I noticed that for very large amount of text data, the algorithm takes a long time to finish. We can probably not simplify the pytorch models (or can we?), but maybe the authors could write a list of the most time consuming operations that could be improved.

This would help to support your efforts with stanza.

AngledLuffa commented 2 months ago

The constituency parser is probably the slowest, so if you don't need conparses, you could consider dropping it.

The tokenizer is a surprising amount of CPU work for building all the token objects. We have in fact talked about pushing some of that into C++ or Rust

sambaPython24 commented 2 months ago
import time
import stanza

text = "The United Nations is a diplomatic and political international organization whose stated purposes are to maintain international peace and security, develop friendly relations among nations, achieve international cooperation, and serve as a centre for harmonizing the actions of nations. It is the world's largest international organization. The UN is headquartered in New York City (in the United States, but with certain extraterritorial privileges), and the UN has other offices in Geneva, Nairobi, Vienna, and The Hague, where the International Court of Justice is headquartered at the Peace Palace."

processors = [
    'tokenize',
    'tokenize,pos',
    'tokenize,pos,constituency',
    'tokenize,mwt',       
    'tokenize,mwt,pos',  
    'tokenize,mwt,pos,lemma',      
    'tokenize,mwt,pos,lemma,depparse'       
             ]
res = {}
for proc in processors:
    nlp = stanza.Pipeline(lang='en', processors=proc)
    start = time.time()
    doc = nlp(text)
    end = time.time()
    res[proc] = end-start

You are right, the time doubles.

The community could certainly participate in putting them to C++, if you could start listing specific functions that would be most useful in C++ (CUDA).

AngledLuffa commented 2 months ago

It's not the cuda usage that's the problem. It's the object creation in the tokenizer and the code that determines whether or not a transition is legal in the parser.

sambaPython24 commented 2 months ago

Could you point to a specific file or function?

AngledLuffa commented 2 months ago

Sure, I think the tokenizer is mostly slower than it could be because of decode_predictions:

https://github.com/stanfordnlp/stanza/blob/6e442a6199f7e466c57c02de8d2f9d516bdd5715/stanza/models/tokenization/utils.py#L463

sambaPython24 commented 1 month ago

It is a bit difficult to get into a system from outside. A very helpful step would be to annotate the different functions in your package, like e.g.

from typing import List,Tuple,Dict, Optional, TypeVar,Mapping,Union
...
def add(a : int ,b : int) -> int: 
    return a + b

When I look at the code, I sometimes see, that it has a defaultvalue of "None" and one would have to print out any argument type at runtime to infer the necessary type.

Ps.: The annotation does not have any effect in python, it is just for readeability and when you want to move to a static language.

AngledLuffa commented 1 month ago

You're not wrong, but, it's also not the limitation stopping us or outside folks from making a faster tokenizer. I'm pretty sure the right answer is to make all of the random little python objects in a compiled language instead.

sambaPython24 commented 1 month ago

Yes, but for the translation to a statically typed language, they would be necessary and would be a great help. If one would start to translate the decode_predictions function, it would be a great help to know, that e.g. the vocab depends on the Vocab class in tokenization/vocab.py by writing decode_predictions(vocab : Vocab, mwt_dict : Dict[str,str],...).

When we translate process_sentence(sentence, mwt_dict=None), the class type of sentence is undefined and I do not know any other way then to print its type when using it (with the mwt_dict it is even more difficult since it is mostly None and only in some cases of type Dict[str,str]. (?))

It is probably easier for somebody that has a general overview over the package to write the annotations.


After a deeper analysis of the code, I think that the first steps towards a C++ implementation could be: