Closed malfonso0 closed 4 years ago
Whether to use stanza or corenlp depends on your requirements. Stanza's models are generally better, but more expensive. For some tasks, the model performance won't even be much different. For example, the English tokenization is very well tuned and extremely fast in corenlp. POS tagging is not significantly different, but dependency parsing and NER recognition is better with stanza. Stanza has a much wider variety of languages. Other tools such as constituency parsing and coref only exist in corenlp.
We can't advise on which of the three options is best for your setup. However, you do bring up an interesting point that our supported python interface doesn't have to be part of stanza. We'll discuss this possibility internally. In the meantime, you can use --no-dependencies when installing with pip to not include pytorch. You may need to manually install some other dependencies needed, such as protobuf and six.
One point to mention: if you want to use stanza neural pipeline to annotate large-scale corpora (e.g., more than 500MB), it would be much better if you could find a GPU machine. Otherwise, we recommend using CoreNLP as it will be much faster on the CPU.
thanks both for your answers.
so stanza right now, in my understanding, is intended to do a local processing. Are you planning in enable a client-server like infrastructure? i'm always thinking in per multi-machine performance. Of course probabbly it depends on the task to be done... if its mostly nlp task, probably local is better, but if the nlp process is just a step, maybe client-server do best.
in another unrelated question, i'm doing a lot of token regex matches over the annotated text. and i'm searching for an approach to do the next two things faster.
right now, for both, i'm just calliing the tokenregex method each time inside a for loop. for 1) i have read that i should concatenate the text with a double \n, to separate sentences... but i dont know if this is the best. for 2.. i read, that in java there is a tokensregex.matcher.MultiMatch but could not find examples in python.
any suggestions?
thanks again
For the same regex, multiple text scenario, depending on how much text you're using, there is a startup cost for some of the processes used. If you are making many queries, you will get better speed from combining them into one query (or at least keeping the java instance alive and sending multiple queries to the same server). Are you finding something that doesn't work with separating the sentences?
it worked separating the sentences... but just to known if there is something better thanks
Hi, and first of all,thanks in advance and sorry if this is an old question or out of place. i have in fact three question. I will post the three here.. but maybe i should do it in differents posts CONTEXT i'm developing a small system where i need to do a nlp process to a document. my architecture is based on aws and is as follow (more or less) API receive file, and leavit in a place... "lot" of consumers are waiting (in order to have high throughput) and finally only one read THAT file, process the file, and leave the result in another place. This is already done (with local nlp server in each). for the "annotators" mostly use ['tokenize','pos','lemma','ner'] for some ['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref','tokensregex', 'tokensregexnq', 'tickerregex'] Also i do a lot of TOKENREGEX search
FIRST QUESTION i am quite new and not yet undertand the difference between using a stanza.Pipeline vs the server.CoreNLPClient.
SECOND QUESTION EACH consumer.. is a diferent EC2 instance. Now i was wondering, what should be the best approach and why.
THIRD QUESTION IF doing the third option, and wanting to minimize the EC2 instance requeriments. when installing stanza it also install pytorch (800mb) which seems unnecesary for this case, is there a way to install a "serverless" stanza
thanks, if any is able to clarify this for me. sorry again if this is not the place, or the questions are already there..