stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

coref not using proper noun #1326

Closed fakerybakery closed 5 months ago

fakerybakery commented 9 months ago

Hi, Thanks for this tool. I noticed that sometimes coref doesn't use the proper noun, is there any way to make it use the proper noun? Here is my code (wip):

import stanza
pipe = stanza.Pipeline("en", processors="tokenize,coref")
t = pipe('"I am doing this," John said. He did it.')

final = []
nouns = []
for sente in t.to_dict():
    sent = []
    exclude_ids = []
    for word in sente:
        if not word['id'] in exclude_ids:
            if type(word['id']) == tuple:
                exclude_ids += word['id']
            if "coref_chains" in word and type(word['coref_chains'] == list):
                if (word['coref_chains']) and not word['coref_chains'][0].is_representative:
                    print(word['coref_chains'][0].to_json())
                    sent.append(word['coref_chains'][0].chain.representative_text)
                else:
                    sent.append(word['text'])
            else:
                sent.append(word['text'])
    sent = [item.strip() for item in sent if item and item.strip()]
    x = ''
    for i in sent:
        if i in ['.', ',', '?', ';', ':']:
            x += i
        else:
            x += ' ' + i
    if sent:
        final.append(x.strip())

print(' '.join(final))

Output: " I am doing this, " I said. I did this. It should be: " John am doing this, " John said. John did this. Thank you!

AngledLuffa commented 9 months ago

I could see breaking ties using proper nouns (if available) being a useful modification

On Thu, Jan 4, 2024 at 5:19 PM mrfakename @.***> wrote:

Hi, Thanks for this tool. I noticed that sometimes coref doesn't use the proper noun, is there any way to make it use the proper noun? Here is my code (wip):

import stanzapipe = stanza.Pipeline("en", processors="tokenize,coref")t = pipe('"I am doing this," John said. He did it.') final = []nouns = []for sente in t.to_dict(): sent = [] exclude_ids = [] for word in sente: if not word['id'] in exclude_ids: if type(word['id']) == tuple: exclude_ids += word['id'] if "coref_chains" in word and type(word['coref_chains'] == list): if (word['coref_chains']) and not word['coref_chains'][0].is_representative: print(word['coref_chains'][0].to_json()) sent.append(word['coref_chains'][0].chain.representative_text) else: sent.append(word['text']) else: sent.append(word['text']) sent = [item.strip() for item in sent if item and item.strip()] x = '' for i in sent: if i in ['.', ',', '?', ';', ':']: x += i else: x += ' ' + i if sent: final.append(x.strip()) print(' '.join(final))

Output: " I am doing this, " I said. I did this. It should be: " John am doing this, " John said. John did this. Thank you!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWI4IYPW6F5NCVYJWYTYM5IKFAVCNFSM6AAAAABBNX2MAGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DMNJYGIZTGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa commented 5 months ago

This is now part of the 1.8.2 release