stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

pipeline vs CoreNLPClient (start) vs CoreNLPClient(connecting to instance) #306

Closed malfonso0 closed 4 years ago

malfonso0 commented 4 years ago

Hi, and first of all,thanks in advance and sorry if this is an old question or out of place. i have in fact three question. I will post the three here.. but maybe i should do it in differents posts CONTEXT i'm developing a small system where i need to do a nlp process to a document. my architecture is based on aws and is as follow (more or less) API receive file, and leavit in a place... "lot" of consumers are waiting (in order to have high throughput) and finally only one read THAT file, process the file, and leave the result in another place. This is already done (with local nlp server in each). for the "annotators" mostly use ['tokenize','pos','lemma','ner'] for some ['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref','tokensregex', 'tokensregexnq', 'tickerregex'] Also i do a lot of TOKENREGEX search

FIRST QUESTION i am quite new and not yet undertand the difference between using a stanza.Pipeline vs the server.CoreNLPClient.

SECOND QUESTION EACH consumer.. is a diferent EC2 instance. Now i was wondering, what should be the best approach and why.

THIRD QUESTION IF doing the third option, and wanting to minimize the EC2 instance requeriments. when installing stanza it also install pytorch (800mb) which seems unnecesary for this case, is there a way to install a "serverless" stanza

thanks, if any is able to clarify this for me. sorry again if this is not the place, or the questions are already there..

AngledLuffa commented 4 years ago

Whether to use stanza or corenlp depends on your requirements. Stanza's models are generally better, but more expensive. For some tasks, the model performance won't even be much different. For example, the English tokenization is very well tuned and extremely fast in corenlp. POS tagging is not significantly different, but dependency parsing and NER recognition is better with stanza. Stanza has a much wider variety of languages. Other tools such as constituency parsing and coref only exist in corenlp.

We can't advise on which of the three options is best for your setup. However, you do bring up an interesting point that our supported python interface doesn't have to be part of stanza. We'll discuss this possibility internally. In the meantime, you can use --no-dependencies when installing with pip to not include pytorch. You may need to manually install some other dependencies needed, such as protobuf and six.

yuhui-zh15 commented 4 years ago

One point to mention: if you want to use stanza neural pipeline to annotate large-scale corpora (e.g., more than 500MB), it would be much better if you could find a GPU machine. Otherwise, we recommend using CoreNLP as it will be much faster on the CPU.

malfonso0 commented 4 years ago

thanks both for your answers.

right now, for both, i'm just calliing the tokenregex method each time inside a for loop. for 1) i have read that i should concatenate the text with a double \n, to separate sentences... but i dont know if this is the best. for 2.. i read, that in java there is a tokensregex.matcher.MultiMatch but could not find examples in python.

any suggestions?

thanks again

AngledLuffa commented 4 years ago

For the same regex, multiple text scenario, depending on how much text you're using, there is a startup cost for some of the processes used. If you are making many queries, you will get better speed from combining them into one query (or at least keeping the java instance alive and sending multiple queries to the same server). Are you finding something that doesn't work with separating the sentences?

malfonso0 commented 4 years ago

it worked separating the sentences... but just to known if there is something better thanks