Open tanloong opened 1 year ago
I think this is something we could actually do with the existing Stanza, actually. We just need to figure out how to change setup.py so that there's an option to install w/o pytorch and so that the module itself doesn't automatically load pytorch when you import stanza
. It's not a super high priority, but I do see the value in such an update.
Oh, that is a more reasonable plan!
I am working on a Python project related to Stanford Parser and Tregex, and I was trying to move to CoreNLP server (through Stanza's server module) for a faster batch mode processing. But the fact described in the above initial comment prevents my hope that users of this project could have a fast installing experience.
The reason that I closed this issue just now is that I found a workaround to get a faster batch mode processing for that project. With JPype1
, it is possible to save the LexicalizedParser
of Stanford Parser in a Python variable and call its parseTree()
method as many times as needed. And installing JPype1 is fast. Indeed, I tried CoreNLP server with a full install of Stanza and found that the server seems to load annotators for every parsing request, making it not significantly faster than Stanford Parser.
However, I think this issue may still be interesting for some others.
Thanks for your kind reply😄.
Well, that's excellent. A couple thoughts:
(This comment is not related to the issue. Skip it.)
That's so great! The project I mentioned has been using Stanford Parser as the constituency parser to keep its output consistent with those generated by anther project from which it originated. Maybe that such a consistency should not prioritize over accuracy.
You are right, sorry for the careless mistake. I retried the CoreNLP (v4.5.1) client and found that the client does keep loaded annotators, either as a context manager
or a variable that saves the CoreNLPClient
object. I probably had confused the initial loading with subsequent parsing-time loadings, as the client seems to print loading logs twice.
It's kind of funny, actually, but once upon a time the models and the client were two separate modules. Then they were merged to create Stanza. As it turns out, there's really not much overlap. There are some data structures which are shared, though, such as a tree structure used to represent a constituency tree. That overlap makes it a little annoying to separate them right down the middle.
The issue with making a lightweight version is that the Pipeline is automatically imported into __init__.py
, which pulls in all the processors, which pulls in torch and all of the other components. Perhaps the easiest way to cut it off is to wrap this block in a try/except ImportError
:
from stanza.pipeline.core import DownloadMethod, Pipeline
from stanza.pipeline.multilingual import MultilingualPipeline
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is there current work on this?
Nothing is planned at this time
Stanza works both as an NLP toolkit itself and the official CoreNLP client with much more functionalities than other CoreNLP wrappers.
For this case users who want only the latter part, when installing Stanza with
pip
, have to install dependencies for the former part, such as the 800MB+pytorch
. Installing these dependencies can cause a long waiting time which is probably unnecessary for them.Can we have a light standalone package, named
stanza-corenlp
for example, as a choice for those who rely only on the CoreNLP-accessing capability of Stanza?