stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.28k stars 893 forks source link

Using multiple models in NER #1334

Open linlinloo opened 9 months ago

linlinloo commented 9 months ago

I want to run the following code, but an error occurred.

import stanza pipe = stanza.Pipeline("en", processors="tokenize,ner", package={"ner": ["ncbi_disease", "ontonotes"]}) doc = pipe("John Bauer works at Stanford and has hip arthritis. He works for Chris Manning") print(doc.ents)

WARNING: Language en package default expects mwt, which has been added

I have downloaded ncbi_disease.pt and placed it in site-packages\stanza\stanza_resources\en\ner What's the problem?and why?

AngledLuffa commented 9 months ago

That's not an error, though. It should work just fine with that warning

On Sun, Jan 21, 2024, 11:41 PM linlinloo @.***> wrote:

I want to run the following code, but an error occurred.

import stanza pipe = stanza.Pipeline("en", processors="tokenize,ner", package={"ner": ["ncbi_disease", "ontonotes"]}) doc = pipe("John Bauer works at Stanford and has hip arthritis. He works for Chris Manning") print(doc.ents)

WARNING: Language en package default expects mwt, which has been added

I have downloaded ncbi_disease.pt and placed it in site-packages\stanza\stanza_resources\en\ner What's the problem?and why?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1334, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOCQ57BIGJKMEGETMLYPYJ2ZAVCNFSM6AAAAABCEXKUXCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TGMRYGIYDONY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

linlinloo commented 9 months ago

However, the operation did not yield any results, and a series of errors would appear: ConnectTimeout, MaxRetryError...... When I run other code, there is no ncbi_disease in ner. Is it the wrong package I have put? Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined | | mwt | combined | | pos | combined_charlm | | lemma | combined_nocharlm | | constituency | ptb3-revised_charlm | | depparse | combined_charlm | | sentiment | sstplus | | ner | ontonotes-ww-multi_charlm |

AngledLuffa commented 9 months ago

If it's giving a timeout error, I would guess the most likely culprit is it's trying to download missing resources and isn't able to connect. You can add download_method=None to the Pipeline to stop it from downloading

On Mon, Jan 22, 2024 at 12:13 AM linlinloo @.***> wrote:

However, the operation did not yield any results, and a series of errors would appear: ConnectTimeout, MaxRetryError...... When I run other code, there is no ncbi_disease in ner. Is it the wrong package I have put? Loading these models for language: en (English): | Processor | Package | | tokenize | combined | | mwt | combined | | pos | combined_charlm | | lemma | combined_nocharlm | | constituency | ptb3-revised_charlm | | depparse | combined_charlm | | sentiment | sstplus | | ner | ontonotes-ww-multi_charlm |

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1334#issuecomment-1903459458, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJBCZJTBAHSZ67PITLYPYNRHAVCNFSM6AAAAABCEXKUXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBTGQ2TSNBVHA . You are receiving this because you commented.Message ID: @.***>

AngledLuffa commented 9 months ago

Also, I should note that for version 1.7.0, the default NER model is now

"ontonotes-ww-multi_charlm"

there's also

"ontonotes_charlm"

They are named this way so that you can get "nocharlm" models if you want faster processing. If there's some stale documentation, please let me know and I'll update it.

linlinloo commented 9 months ago

I find ontonotes_charlm.pt, and I can download it, do you meant that I should replace ontonotes-ww-multi_charlm? And sorry, how to add download_method=None. Like this? pipe = stanza.Pipeline("en", download_method=None )

AngledLuffa commented 9 months ago

I find ontonotes_charlm.pt, and I can download it, do you meant that I should replace ontonotes-ww-multi_charlm?

You can do whatever you like, of course. The ww-multi model was trained on both OntoNotes and the dataset described in this paper

And sorry, how to add download_method=None. Like this? pipe = stanza.Pipeline("en", download_method=None )

Yes, exactly. I suggest that because it's the most likely reason you're getting timeouts. If the problem is somewhere else, please include the complete stack trace.