stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

Smaller memory footprint available? #1221

Open nlp12 opened 2 years ago

nlp12 commented 2 years ago

I am working with very large input files, (100kb-200kb) in Chinese (ZH) and am getting running out of memory exception. Is there any way to reduce the amount of memory required to run Corenlp in JAVA so that I can realistically run it without losing memory. Smaller files (20kb) work fine and only took 1 minute, but bulk of what I need run are very large files. Please help, thank you!

AngledLuffa commented 2 years ago

Can you reduce the size of the files?

Which annotators are you using? The standard set? Do you use all of the annotations?

On Tue, Nov 23, 2021 at 12:31 PM nlp12 @.***> wrote:

I am working with very large input files, (100kb-200kb) in Chinese (ZH) and am getting running out of memory exception. Is there any way to reduce the amount of memory required to run Corenlp in JAVA so that I can realistically run it without losing memory. Smaller files (20kb) work fine and only took 1 minute, but bulk of what I need run are very large files. Please help, thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMCXJGODGF2UYVUMETUNP23BANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

Yes, that's what we did for it to work, but was wondering if there was a different fix.

Just the standard set.

AngledLuffa commented 2 years ago

You could always figure out if there are any annotators you don't need and prune them from the list. That would be a good first step for saving memory

On Tue, Nov 23, 2021 at 5:39 PM nlp12 @.***> wrote:

Yes, that's what we did for it to work, but was wondering if there was a different fix.

Just the standard set.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221#issuecomment-977382093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLZKZU5HFBKXLXS7Z3UNQ65XANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

Really I am just looking for a constituency parser separating phrases in my corpus, what would you recommend is the best command for this? Thank You!

strongerfly commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

AngledLuffa commented 2 years ago

You can set annotators="tokenize,pos,parse" for that

On Tue, Dec 7, 2021, 8:51 AM nlp12 @.***> wrote:

Really I am just looking for a constituency parser separating phrases in my corpus, what would you recommend is the best command for this? Thank You!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221#issuecomment-988101217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIUIZTVCWBLH5YV4KDUPY3QXANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

Did that, however now we are dealing with some issues now. Hoping you can assist.

  1. Is there any way to specify the level of the tree output as a tab sep. value? I want to get phrases at a specified level printed together in the same row, not separated.
  2. The parse sometimes is not making sense, how to circumvent this issue? For example see attachment Screen Shot 2021-12-08 at 3 42 24 PM
  3. Conll output is just the tokenization, how can I get specific parsed items thru command? For example, in my attachment, I just want the 2 NPs "五言" and "古诗" as such, but it prints it in conll as "五" "言" "古诗". Thank you!
AngledLuffa commented 2 years ago

"言 " is being used as a measure word here, right? I don't see anything wrong with the tokenization. The parse goes awry because there's no similar use of 言 as a measure word in the training data.

On Wed, Dec 8, 2021 at 1:47 PM nlp12 @.***> wrote:

Did that, however now we are dealing with some issues now. Hoping you can assist.

  1. Is there any way to specify the level of the tree output as a tab sep. value? I want to get phrases at a specified level printed together in the same row, not separated.
  2. The parse sometimes is not making sense, how to circumvent this issue? For example see attachment [image: Screen Shot 2021-12-08 at 3 42 24 PM] https://user-images.githubusercontent.com/78334409/145289099-22478572-73b0-4613-85f1-9c41ddb56dfd.png
  3. Conll output is just the tokenization, how can I get specific parsed items thru command? For example, in my attachment, I just want the 2 NPs "五言" and "古诗" as such, but it prints it in conll as "五" "言" "古诗". Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221#issuecomment-989222261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMLOBLCLICM5EG7OPLUP7G6FANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

"五言" goes together, it refers to the syllable count per line of a style of poetry, nothing to do with measure word here.

The tokenization is fine, just looking to extract the specific levels of the constant tree thru command so I am grabbing phrases I want (e.g.,dependent gloss and governor gloss together on the same row)

AngledLuffa commented 2 years ago

Alright, my Chinese is not great so I'll take your word for it. Still, those two characters as a single word do not show up in the training data either.

When you talk about dependent and governor glosses on the same line, are you actually more interested in dependency parsing or constituency parsing? There's also a Chinese dependency parser in CoreNLP (and a more accurate one in our python tool, for that matter).

nlp12 commented 2 years ago

you are right, they are not a single word, so maybe this is a bad example. nevertheless, if "五言古诗" is split at constituency tree from top to bottom, its first "五言" “古诗” then "五" "言" "古诗", but not "五" "言古诗" like corenlp incorrectly split it.

I have had great success with python stanza and corenlp to get the dependency parser working splitting up words, but not larger units, that's why I turn to the constituency parser.

HOWEVER, I would like to just use the constituency parser so that I am only getting larger units (syntactic constituents), to be displayed as tab separated values.

MY question is, how can we get the constituents (e.g., subject NP and predicate VP) to be printed in the conll, rather than only being visible in the text output as a bracketed tree? Thank you!

AngledLuffa commented 2 years ago

That makes sense. FWIW, I think this is just a bad example of something that doesn't exist in the training set.

In terms of alternate output formats, it sounds like we don't have exactly what you're looking for, but you should be able to recreate it from the trees. Are you still working on the python side or are you using Java directly to get the constituency trees?

On Thu, Dec 9, 2021 at 8:23 AM nlp12 @.***> wrote:

you are right, they are not a single word, so maybe this is a bad example. nevertheless, if "五言古诗" is split at constituency tree from top to bottom, its first "五言" “古诗” then "五" "言" "古诗", but not "五" "言古诗" like corenlp incorrectly split it.

I have had great success with python stanza and corenlp to get the dependency parser working splitting up words, but not larger units, that's why I turn to the constituency parser.

HOWEVER, I would like to just use the constituency parser so that I am only getting larger units (syntactic constituents), to be displayed as tab separated values.

MY question is, how can we get the constituents (e.g., subject NP and predicate VP) to be printed in the conll, rather than only being visible in the text output as a bracketed tree? Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221#issuecomment-990008042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO3QTV4U5CPZQNZGOTUQDJZXANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

We use both, but for the constituency trees, we are using JAVA. What do you suggest we do so we can specify the levels in the constituency tree we want in the output as tab separated value?

AngledLuffa commented 2 years ago

The limitation here is that we don't know exactly what you're looking for. Are you only looking for the lowest NP, VP, etc nodes? In general, these nodes are nested (some brackets removed for readability):

(NP more mail (PP than (NP (NP almost any other issue) (PP in memory))))

So, do you want all 3 NPs, or just the smallest one, or the biggest one, or ...?

Another thought is, we can provide a more accurate Chinese model in Stanza (python) now, or we can provide an even more accurate one if you're okay with using Bert as part of a Stanza model. If that would be useful, let me know

On Fri, Dec 10, 2021 at 4:40 AM nlp12 @.***> wrote:

We use both, but for the constituency trees, we are using JAVA. What do you suggest we do so we can specify the levels in the constituency tree we want in the output as tab separated value?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1221#issuecomment-990940950, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKC4UQB3EVCOPJ3OGDUQHYMLANCNFSM5IUNEDSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nlp12 commented 2 years ago

I am wondering whether we can specify the level of the constituency parse we want so that the conll represents a specified level, for example in the example below, specifying the VP 无法集中, rather than what it is printed as (seperated out) Screen Shot 2021-12-13 at 11 46 56 AM Yes, that would be useful to try out. Please let me know where to go and what to do to download this. Thank you!

AngledLuffa commented 2 years ago

You would need to programmatically identify the level of the tree you want to access. In both Java and Python the tree is an interface with the label of the node and a list of children. In Java there is also a "tregex" tool which is similar to regex, except it applies to trees. We'll eventually provide an interface for using the tregex search with python produced trees, but we haven't done that yet.

If you can explain how to choose the specific VP you want, I can try to answer how to find it by searching over the tree...

As for stanza ZH conparser, you would just need to install the dev branch:

pip install git+git:// @.***

stanza.download("zh")

then the ZH pipeline should have a conparser in it. There's a more accurate one using Bert (specifically, the WMM model from HIT) which has a word length limit... we haven't implemented a way around that yet, unfortunately. You can download with

stanza.download("zh", package="ctb-inorder-bert")

nlp12 commented 2 years ago

And where can we go to find more info on how to programmatically specify level(s) of tree we want (tab separated or line separated) in Java? It is figuring out the level equivalent in the "tregex" and then grabbing that?

What we are looking to do is, since every line in our corpus is a sentence, we are looking to grab the larger phrases for each line and seperate them onto their own lines.

Thank you for letting me know. Word limit is an issue for me, so I will avoid Bert for now.

AngledLuffa commented 2 years ago

And where can we go to find more info on how to programmatically specify level(s) of tree we want

I don't have a clear picture in mind yet of how you plan on determining this. The example you posted earlier has 3 separate VPs, 无法集中, 集中, and basically the entire sentence. How do you want to distinguish those?

Thank you for letting me know. Word limit is an issue for me, so I will avoid Bert for now.

The word limit is quite long, well over 100 words, so it might not be applicable.

nlp12 commented 2 years ago

Your constituency parse is separating out those higher level phrases, how can we grab those parsed phrases and include it in our line separated/tab separated value? Or is there no way to convert the constituency parse to conll?