Requiring for a standalone CoreNLP client package

stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

https://stanfordnlp.github.io/stanza/

Other

7.28k stars 894 forks source link

Requiring for a standalone CoreNLP client package #1192

Open tanloong opened 1 year ago

tanloong commented 1 year ago

Stanza works both as an NLP toolkit itself and the official CoreNLP client with much more functionalities than other CoreNLP wrappers.

For this case users who want only the latter part, when installing Stanza withpip, have to install dependencies for the former part, such as the 800MB+ pytorch. Installing these dependencies can cause a long waiting time which is probably unnecessary for them.

Can we have a light standalone package, named stanza-corenlp for example, as a choice for those who rely only on the CoreNLP-accessing capability of Stanza?

AngledLuffa commented 1 year ago

I think this is something we could actually do with the existing Stanza, actually. We just need to figure out how to change setup.py so that there's an option to install w/o pytorch and so that the module itself doesn't automatically load pytorch when you import stanza. It's not a super high priority, but I do see the value in such an update.

tanloong commented 1 year ago

Oh, that is a more reasonable plan!

I am working on a Python project related to Stanford Parser and Tregex, and I was trying to move to CoreNLP server (through Stanza's server module) for a faster batch mode processing. But the fact described in the above initial comment prevents my hope that users of this project could have a fast installing experience.

The reason that I closed this issue just now is that I found a workaround to get a faster batch mode processing for that project. With JPype1, it is possible to save the LexicalizedParser of Stanford Parser in a Python variable and call its parseTree() method as many times as needed. And installing JPype1 is fast. Indeed, I tried CoreNLP server with a full install of Stanza and found that the server seems to load annotators for every parsing request, making it not significantly faster than Stanford Parser.

However, I think this issue may still be interesting for some others.

Thanks for your kind reply😄.

AngledLuffa commented 1 year ago

Well, that's excellent. A couple thoughts:

The constituency parser in Stanza is significantly more accurate than the one in CoreNLP (although there's a length limit if you use the most accurate one, the Bert version)
The client shouldn't reload annotators for every request. There should be the ability to make a server context_manager which keeps the annotators between requests. If you're doing that, and it still reloads annotators every time, that's a bug and we should be able to fix it

tanloong commented 1 year ago

(This comment is not related to the issue. Skip it.)

That's so great! The project I mentioned has been using Stanford Parser as the constituency parser to keep its output consistent with those generated by anther project from which it originated. Maybe that such a consistency should not prioritize over accuracy.

You are right, sorry for the careless mistake. I retried the CoreNLP (v4.5.1) client and found that the client does keep loaded annotators, either as a context manager or a variable that saves the CoreNLPClient object. I probably had confused the initial loading with subsequent parsing-time loadings, as the client seems to print loading logs twice.

AngledLuffa commented 1 year ago

It's kind of funny, actually, but once upon a time the models and the client were two separate modules. Then they were merged to create Stanza. As it turns out, there's really not much overlap. There are some data structures which are shared, though, such as a tree structure used to represent a constituency tree. That overlap makes it a little annoying to separate them right down the middle.

The issue with making a lightweight version is that the Pipeline is automatically imported into __init__.py, which pulls in all the processors, which pulls in torch and all of the other components. Perhaps the easiest way to cut it off is to wrap this block in a try/except ImportError:

from stanza.pipeline.core import DownloadMethod, Pipeline
from stanza.pipeline.multilingual import MultilingualPipeline

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

talw-nym commented 3 months ago

Is there current work on this?

AngledLuffa commented 3 months ago

Nothing is planned at this time