pipeline vs CoreNLPClient (start) vs CoreNLPClient(connecting to instance)

malfonso0 commented 4 years ago

Hi, and first of all,thanks in advance and sorry if this is an old question or out of place. i have in fact three question. I will post the three here.. but maybe i should do it in differents posts CONTEXT i'm developing a small system where i need to do a nlp process to a document. my architecture is based on aws and is as follow (more or less) API receive file, and leavit in a place... "lot" of consumers are waiting (in order to have high throughput) and finally only one read THAT file, process the file, and leave the result in another place. This is already done (with local nlp server in each). for the "annotators" mostly use ['tokenize','pos','lemma','ner'] for some ['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref','tokensregex', 'tokensregexnq', 'tickerregex'] Also i do a lot of TOKENREGEX search

FIRST QUESTION i am quite new and not yet undertand the difference between using a stanza.Pipeline vs the server.CoreNLPClient.

stanza uses pythorch with some models, and CoreNLPClient connecto to the java server instance with other models? is this right?
in my understanding, i can do the same with both? is this right? if both are true.. i still dont figure when use each case? which i need for the next question

SECOND QUESTION EACH consumer.. is a diferent EC2 instance. Now i was wondering, what should be the best approach and why.

EC2 with ??GB ram and using pipeline
EC2 with 4-8GB ram and using CoreNLPCLient(start true), with local server
EC2 nano with .5GB ram and using CoreNLPCLient(start false), connectedserver to a 1 big EC2 with the stanford server running and listenning

THIRD QUESTION IF doing the third option, and wanting to minimize the EC2 instance requeriments. when installing stanza it also install pytorch (800mb) which seems unnecesary for this case, is there a way to install a "serverless" stanza

thanks, if any is able to clarify this for me. sorry again if this is not the place, or the questions are already there..

AngledLuffa commented 4 years ago

Whether to use stanza or corenlp depends on your requirements. Stanza's models are generally better, but more expensive. For some tasks, the model performance won't even be much different. For example, the English tokenization is very well tuned and extremely fast in corenlp. POS tagging is not significantly different, but dependency parsing and NER recognition is better with stanza. Stanza has a much wider variety of languages. Other tools such as constituency parsing and coref only exist in corenlp.

We can't advise on which of the three options is best for your setup. However, you do bring up an interesting point that our supported python interface doesn't have to be part of stanza. We'll discuss this possibility internally. In the meantime, you can use --no-dependencies when installing with pip to not include pytorch. You may need to manually install some other dependencies needed, such as protobuf and six.

yuhui-zh15 commented 4 years ago

One point to mention: if you want to use stanza neural pipeline to annotate large-scale corpora (e.g., more than 500MB), it would be much better if you could find a GPU machine. Otherwise, we recommend using CoreNLP as it will be much faster on the CPU.

malfonso0 commented 4 years ago

thanks both for your answers.

so stanza right now, in my understanding, is intended to do a local processing. Are you planning in enable a client-server like infrastructure? i'm always thinking in per multi-machine performance. Of course probabbly it depends on the task to be done... if its mostly nlp task, probably local is better, but if the nlp process is just a step, maybe client-server do best.
in another unrelated question, i'm doing a lot of token regex matches over the annotated text. and i'm searching for an approach to do the next two things faster.
1. same regex, multiple text.
2. same text, and multiple regexs

right now, for both, i'm just calliing the tokenregex method each time inside a for loop. for 1) i have read that i should concatenate the text with a double \n, to separate sentences... but i dont know if this is the best. for 2.. i read, that in java there is a tokensregex.matcher.MultiMatch but could not find examples in python.

any suggestions?

thanks again

AngledLuffa commented 4 years ago

For the same regex, multiple text scenario, depending on how much text you're using, there is a startup cost for some of the processes used. If you are making many queries, you will get better speed from combining them into one query (or at least keeping the java instance alive and sending multiple queries to the same server). Are you finding something that doesn't work with separating the sentences?

malfonso0 commented 4 years ago

it worked separating the sentences... but just to known if there is something better thanks

stanfordnlp / stanza

pipeline vs CoreNLPClient (start) vs CoreNLPClient(connecting to instance) #306