stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 886 forks source link

Distributed Processing with Stanza CoreNLP Interface #720

Open aicaffeinelife opened 3 years ago

aicaffeinelife commented 3 years ago

Hi,

I've been really liking how Stanza just "works" out of the box since the last month or so. However, I have recently hit a wall and the documentation is a little sparse on the Stanza CoreNLP client. The problem is this - I want to extract relations from a large collection of text (~90k sentences on average). To do this on a single machine sequentially would be prohibitively time consuming. Hence I want to develop a distributed interface that can run the extractor on separate cores with different chunks of data.

I detail my current approach and issues below:

On a machine, I start a java server with the following command:

java -Xmx5G -cp "/path/to/stanza_corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 \
-timeout 60000 \
-threads 8 \
-maxCharLength 100000 \
-quiet False \
-annotators openie -preload -outputFormat serialized

Upon inspection of the logs, I find this message: java.lang.IllegalArgumentException: annotator "openie" requires annotation "IndexAnnotation" and apparently this requires all the other annotators to be loaded.

On the client side, I have implemented a rudimentary solution like this:

from stanza.server import CoreNLPClient
from joblib import Parallel, delayed

def chunker(iterable, chunk_size):
    return (
        iterable[pos : pos + chunk_size]
        for pos in range(0, len(iterable), chunk_size)
    )

def worker_fn(client, batch):
    annotated = []
    for txt in batch:
        ann = client.annotate(txt)
        annotated.append(ann)
    return annotated

# the server end point is the default localhost:9000
client = CoreNLPClient(annotators=["openie"], start_server=False)
data = pickle.load(files[0].open('rb'))
texts = [d['sent_text'] for d in data]
tasks = (delayed(worker_fn)(client, chunk) for chunk in chunker(texts, 100))
result = Parallel(n_jobs=8, backend='multiprocessing', prefer='processes')(tasks)
client.stop()

Doing this results in MaybeEncodingError where the message being passed around is the complete tokenization, pos-tagged shard of the sentence text. However, I only intend to pass around the annotation triplets.

My questions then are:

  1. Can multiprocessing be implemented using the stanza.server interface?
  2. How can we make it work with parallel libraries like multiprocessing or joblib?

I look forward to your suggestions, hint etc.

Thanks

AngledLuffa commented 3 years ago

Are you looking to distribute your work to multiple server machines or just have the one server machine do all the work in multiple threads? The Java server already has multithreading capabilities. This option is in the documentation for "server start options", so while we're always happy to get suggestions on how to improve the documentation, I don't think anything was missing.

https://stanfordnlp.github.io/stanza/client_properties.html#corenlp-server-start-options-server

You are likely to hit a wall with how much multithreading helps, as often happens in java programs, but you can try increasing it beyond -threads=5.

You mention separate "cores" which is why I ask if you want a single machine doing the work. You also mention possibly using multiprocessing on the client side, though, which makes me think you may want to distribute the work to multiple machines. The easiest way to do that would be to split the input into multiple files and save the results for later. There's currently no plan to add a multi-server client ourselves, although we generally try to respond to pull requests. You could also leave this issue open and maybe one day we'll do it ourselves, although that requires someone being inspired to do so.

Unfortunately there is no way to only send the openie results back across the wire. You can make such a thing yourself by changing edu/stanford/nlp/pipeline/ProtobufAnnotationSerializer.java and then recompiling the server. Generally I would expect that the programmer time is not worth the savings in bandwidth, but again, if you make such a thing an option in ProtobufAnnotationSerializer and send it back via pull request, it might be useful for other people in the future as well.

aicaffeinelife commented 3 years ago

Well, the current problem was more focused on developing a sub-module that first works on a single machine and then can be generalized to multiple machines. By cores, I meant cores on a single machine.

I understand about the multi-server client not being the priority, but I'm still curious about me not being able to run the openie annotator standalone. When I start the server and client like this:

with CoreNLPClient(annotators=["openie"]) as client:
    for sent in sents:
        ann = client.annotate(sent)
        annos.append(ann)

I'm able to get the openie predictions back (since start_server is enabled). But when I start my own server and instantiate a client (start_server = False), the error message pops up in the server logs. I just reverified the command that's generated by the CoreNLPClient with start_server enabled, and it only differs in the reference to a temp property file. Am I doing something obviously wrong here?

Re: The serialization and implementing multi-machine server/client, I do think that it would be a useful enhancement since stanza has the potential to be production grade out of the box with that ability. I'd leave this issue open in the interest of the community.

AngledLuffa commented 3 years ago

I'm still curious about me not being able to run the openie annotator standalone.

Either way the requirements are the same. The python client does its best to fill in the missing requirements. The standalone java server is far less helpful... sorry! Hopefully it at least tells you what the missing requirements are. For example, it's telling me

[main] ERROR CoreNLP - java.lang.IllegalArgumentException: annotator "openie" requires annotation "IndexAnnotation". The usual requirements for this annotator are: tokenize,ssplit,pos,lemma,depparse,natlog

so that tells me what additional annotators I need to add.

On Sun, Jun 13, 2021 at 12:47 PM Ankit @.***> wrote:

Well, the current problem was more focused on developing a sub-module that first works on a single machine and then can be generalized to multiple machines. By cores, I meant cores on a single machine.

I understand about the multi-server client not being the priority, but I'm still curious about me not being able to run the openie annotator standalone. When I start the server and client like this:

with CoreNLPClient(annotators=["openie"]) as client: for sent in sents: ann = client.annotate(sent) annos.append(ann)

I'm able to get the openie predictions back (since start_server is enabled). But when I start my own server and instantiate a client (start_server = False), the error message pops up in the server logs. I just reverified the command that's generated by the CoreNLPClient with start_server enabled, and it only differs in the reference to a temp property file. Am I doing something obviously wrong here?

Re: The serialization and implementing multi-machine server/client, I do think that it would be a useful enhancement since stanza has the potential to be production grade out of the box with that ability. I'd leave this issue open in the interest of the community.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/720#issuecomment-860261113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMRDQZ3IRFNWA33EU3TSUDO7ANCNFSM46T6OL7Q .

aicaffeinelife commented 3 years ago

Ah okay thanks! That was a little confusing

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed due to inactivity.

henrytseng commented 2 years ago

I noticed that this issue became stale after some time. Would like to inquire if there's a chance that this might be reopened. In some instances a server might have multiple GPU's - it would be very useful to be able to control how processing can be distributed. Was there any future plans for this?

In addition, it appears that GPU processing is being utilized in our setup around 17%.

AngledLuffa commented 2 years ago

Happy to take suggestions on increasing the 17% :/

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.