huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.84k stars 474 forks source link

add SCONJ to REMOVE_POS to exclude subordinating conjunction from mention span detection #276

Closed noelslice closed 3 years ago

noelslice commented 4 years ago

Using the same example input mentioned here: https://github.com/huggingface/neuralcoref/issues/215#issuecomment-568702452 there seems to be a spurious mention "than Shyam" because the subordinating conjunction "than" was not excluded in the mention span detection.

This PR adds the SCONJ tag to the REMOVE_POS list.

Test case:

import spacy
import neuralcoref

nlp = spacy.load('en_core_web_lg')
neuralcoref.add_to_pipe(nlp, greedyness=0.5)

doc = nlp(u'Ram and Shyam are good boys. Ram is older than Shyam. But, they are not friends.')

from pprint import pprint
print(doc._.coref_clusters)
pprint(doc._.coref_scores)

Current output:

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7628642320632935, Ram: -1.576068639755249},
 Shyam: {Ram: -1.5397948026657104,
         Ram and Shyam: -1.5207256078720093,
         Shyam: 1.6105855703353882},
 good boys: {Ram: -1.5992192029953003,
             Ram and Shyam: -1.5002832412719727,
             Shyam: -1.6027263402938843,
             good boys: 1.738552212715149},
 Ram: {Ram: 7.551267623901367,
       Ram and Shyam: -0.8156640529632568,
       Shyam: -1.614872932434082,
       good boys: -1.514532446861267,
       Ram: 1.5904799699783325},
 than Shyam: {Ram: -1.5681246519088745,
              Ram and Shyam: -1.4285391569137573,
              Shyam: -1.5769987106323242,
              good boys: -1.5076005458831787,
              Ram: -1.5768016576766968,
              than Shyam: 1.704783320426941},
 Shyam: {Ram: -1.6349478960037231,
         Ram and Shyam: -1.1569286584854126,
         Shyam: 5.653580665588379,
         good boys: -1.526012897491455,
         Ram: -1.6253626346588135,
         than Shyam: -1.5083305835723877,
         Shyam: 1.242653489112854},
 they: {Ram: -2.0989551544189453,
        Ram and Shyam: -0.7402747869491577,
        Shyam: -2.3023903369903564,
        good boys: -1.5382691621780396,
        Ram: -2.296427011489868,
        than Shyam: -1.0285108089447021,
        Shyam: -2.670758008956909,
        they: 0.07739335298538208},
 friends: {Ram: -1.5777109861373901,
           Ram and Shyam: -1.5296742916107178,
           Shyam: -1.725807785987854,
           good boys: -1.5094072818756104,
           Ram: -1.5740591287612915,
           than Shyam: -1.5106748342514038,
           Shyam: -1.783818006515503,
           they: -1.5725568532943726,
           friends: 2.009723663330078}}

New output ("than Sham" excluded):

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7629910707473755, Ram: -1.5760746002197266},
 Shyam: {Ram: -1.5397844314575195,
         Ram and Shyam: -1.5207990407943726,
         Shyam: 1.6113454103469849},
 good boys: {Ram: -1.5991358757019043,
             Ram and Shyam: -1.5002236366271973,
             Shyam: -1.602735996246338,
             good boys: 1.7384239435195923},
 Ram: {Ram: 7.543191909790039,
       Ram and Shyam: -0.8214647769927979,
       Shyam: -1.6146637201309204,
       good boys: -1.5146090984344482,
       Ram: 1.5892621278762817},
 Shyam: {Ram: -1.578922986984253,
         Ram and Shyam: -0.6316158771514893,
         Shyam: 7.046931266784668,
         good boys: -1.525830626487732,
         Ram: -1.813422441482544,
         Shyam: 1.1222282648086548},
 they: {Ram: -2.0966665744781494,
        Ram and Shyam: -0.29233384132385254,
        Shyam: -2.266399621963501,
        good boys: -1.5540210008621216,
        Ram: -2.2621068954467773,
        Shyam: -2.6278762817382812,
        they: 0.0765305757522583},
 friends: {Ram: -1.5773955583572388,
           Ram and Shyam: -1.5293686389923096,
           Shyam: -1.721515417098999,
           good boys: -1.5099279880523682,
           Ram: -1.5666728019714355,
           Shyam: -1.809272050857544,
           they: -1.5722771883010864,
           friends: 2.0099644660949707}}

The live demo also doesn't display this mention:

Screenshot from 2020-07-15 13-53-24

noelslice commented 4 years ago

disclaimer: I'm still not convinced the logic in extract_mentions_spans and _extract_from_sent is robust. Working on my understanding of the code. It would help to add some test cases.

svlandeg commented 3 years ago

Thanks for this PR @noelslice! Looks good to me. There are definitely parts of the code base that could use more test cases - all contributions welcome!

noelslice commented 3 years ago

Thanks for having a look and merging this in @svlandeg !