spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.38k stars 651 forks source link

Coreference Resolution #565

Open oaguy1 opened 5 years ago

oaguy1 commented 5 years ago

Hello! I am looking into using coreference resolution in a project I am working. There exist a reasonably easy (read: does not require a neural network and training) algorithm to do just this and I was thinking of adding it to this library. I read the contributing guide and wanted to make an issue to test the water before spending a lot of time working on this.

oaguy1 commented 5 years ago

A simpler explanation and sample implementation of the algorithm I mentioned can be found here.

spencermountain commented 5 years ago

YESSSSS go for it! any ideas about how you'd like to handle the api for it? I'd be happy to help.

something like this?

let doc=nlp('Carrots are orange. They are delicious.')
doc.pronouns().data()
// [{text:'they', normal:'they', reference:'carrots'}]

doc.nouns().data()
//[{text:'carrots', normal:'carrots', references:['they']}]

something like that? there is a term-id property, (i think) that you could use, too. anyways, yeah. sounds great. go for it.

spencermountain commented 5 years ago

it may be desirable too, to actually fetch the reference word(s), so that people can do whatever they want to the results, like replace them or something.

The only tricky-part i can imagine is tracking-down the reference word(s), and packing them into a Text object, so that a person can do doc.match('#Vegetable').nouns().references().match('#whatever').toUpperCase()... and so on. This could get a little complicated. I'm happy to help

oaguy1 commented 5 years ago

Glad you are excited! The algorithm I linked to can track down the the references of the pronouns in a manner that is right most of the time (80%).

The way I was thinking about approaching this was adding an additional tagging step where we looked at each of the pronouns and then use Hobbs’ algorithm to find the best guess at the antecedent. With that in mind, my initial plan for the API was something like this:

// grabbing the antecendt to a pronoun
doc.match(“#Pronoun”).get(0).antecedent();

// grabbing the pronouns for person
doc.people().get(0).pronouns();

I think once we have the additional API built out for Terms, something closer to what you initially suggested on the more macro/document level.

Let me know what you think! I plan on sitting down and putting some more time on this tomorrow.

spencermountain commented 5 years ago

yeah cool! to make it feel like the other methods, i'd do it like this

doc.match(“#Pronoun”).antecedents(0);
doc.people().pronouns(0);

either way, happy to see this in-action, then we can shove it around after.

been thinking past few weeks about breaking-up compromise into more micro-libraries, like d3 did. If we end up doing that, this work will end-up in a named-entity-plugin, or something like that (just a heads-up) thanks, lemme know if I can help with anything.

oaguy1 commented 5 years ago

Sounds good! I definitely want to try to keep the API as close to the rest of the library as possible. I am hacking on this when I have time, but still won't have much to share for a while. Once I have a good working MVP with tests I will make a PR and we can really play with it.

oaguy1 commented 5 years ago

@spencermountain As part of the algorithm I am implementing, I am trying to start from an individual Term (an instance of a pronoun) and then move to the previous Term in the sentence to see if it matches some criteria. Once the beginning of the sentence is reached, the "previous" term would the last Term in the previous sentence. It would also be good to know if the previous term came from another sentence or paragraph. Is there support for such movement within the text currently in the lib? If not, where would be a good place to start for adding it?

spencermountain commented 5 years ago

hey David, yeah you may want to just use the internal arrays of sentences, and terms.

let doc = nlp(myText)
doc.list //arrays of sentences
doc.list[0].terms // terms in each sentence

we don't have any support for paragraphs (right now)

oaguy1 commented 5 years ago

That is helpful, thank you so much for responding so quickly.

I was thinking, if you only have sentences of terms, how do you feel about adding some sort of index to the Term objects, so they are aware of their position within the document? I could add this during the build process, an attribute named something like refPosition with a two item length array [index of sentence, index of term].

Let me know what you think, I don't want to be too crazy adding things w/o checking in.

spencermountain commented 5 years ago

hey David, yeah this has been the hard-part of making compromise, that 'position within the document' changes considerably, and depends on where the user is zooming-in, cloning, etc.

I've started working on a major re-write, for v12, that you may be interested in, over here. It uses a linked-list model, so references, and indexes are more 'postmodern', and don't suffer any of the awkwardness you're going through.

I'm also concerned that adding in co-reference resolution to v11 may be more complicated than it would be in v12. It's not very solid yet, and still moving-around in some circles..

How would you feel about me creating a compromise-coreference repo, and us working on it there?

That would give us an opportunity to implement that Hobbs paper, without worrying about api changes:

const nlp=require('compromise')
const ccr=require('compromise-coreference')

let doc=nlp(myText)
let json = ccr(doc)
/* {whatever json-structure you'd like} */

how's that?

oaguy1 commented 5 years ago

That would be a great stop-gap between APIs and a good place to get the algorithm down before (possibly) adding it to the full lib. If it would be easier to implement the paper in the new API, I am happy to lend a hand to speed that along, too.

Thank you for all the help so far! It is nice to have a positive contribution experience on such a cool project.

On Fri, Mar 8, 2019 at 09:42 spencer kelly notifications@github.com wrote:

hey David, yeah this has been the hard-part of making compromise, that 'position within the document' changes considerably, and depends on where the user is zooming-in, cloning, etc.

I've started working on a major re-write, for v12, that you may be interested in, over here https://github.com/spencermountain/compromise/tree/linked-list. It uses a linked-list model, so references, and indexes are more 'postmodern', and don't suffer any of the awkwardness you're going through.

I'm also concerned that adding in co-reference resolution to v11 may be more complicated than it would be in v12. It's not very solid yet, and still moving-around in some circles..

How would you feel about me creating a compromise-coreference repo, and us working on it there?

That would give us an opportunity to implement that Hobbs paper, without worrying about api changes:

const nlp=require('compromise') const ccr=require('compromise-coreference')

let doc=nlp(myText) let json = ccr(doc) / {whatever json-structure you'd like} /

how's that?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/565#issuecomment-470951210, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPxnul48f8rfEyU-vJIUbKrQuySFv4Qks5vUnbLgaJpZM4ay2cL .

spencermountain commented 5 years ago

hey, i've added you to a basic version of this here. take it for a ride - feel free to commit directly to it, it's pretty-rough! cheers

oaguy1 commented 5 years ago

Awesome, thanks!

On Mon, Mar 11, 2019 at 18:23 spencer kelly notifications@github.com wrote:

hey, i've added you to a basic version of this here https://github.com/nlp-compromise/compromise-coreference. take it for a ride - feel free to commit directly to it, it's pretty-rough! cheers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/565#issuecomment-471763470, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPxniRtoEzMAe7XEzaJAuWq-PrhQspYks5vVtdjgaJpZM4ay2cL .

oaguy1 commented 5 years ago

Also, have you considered a tree structure over a linked list? That seems to be the data structure that many nlp libs use and makes adding additional depths (paragraphs, etc) and exact document position more doable.

On Mon, Mar 11, 2019 at 18:25 David Hughes-Robinson oaguy1@gmail.com wrote:

Awesome, thanks!

On Mon, Mar 11, 2019 at 18:23 spencer kelly notifications@github.com wrote:

hey, i've added you to a basic version of this here https://github.com/nlp-compromise/compromise-coreference. take it for a ride - feel free to commit directly to it, it's pretty-rough! cheers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/565#issuecomment-471763470, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPxniRtoEzMAe7XEzaJAuWq-PrhQspYks5vVtdjgaJpZM4ay2cL .

spencermountain commented 5 years ago

i'd love to hear more about this idea, how do you imagine it working?

oaguy1 commented 5 years ago

Would love to discuss. I created a slack channel so we can go back and forth without polluting this issue.

https://compromisenlp.slack.com/

spencermountain commented 5 years ago

wanna just join the existing slack group?

oaguy1 commented 5 years ago

Yes! I will delete the group I created (should have searched first)

On Thu, Mar 14, 2019 at 10:08 spencer kelly notifications@github.com wrote:

wanna just join the existing slack group https://slackin-kyzvclgjlg.now.sh/?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/spencermountain/compromise/issues/565#issuecomment-472871922, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPxnhBfuEgnCudHv23UrBLpvbKkgXVpks5vWlflgaJpZM4ay2cL .

au-re commented 1 year ago

hi! sorry to comment on an old issue, but I was wondering if coreference resolution eventually did become part of compromise?

spencermountain commented 1 year ago

hey Aurélien - on my new-years resolutions this year.

There's actually an undocumented api for it here - i wouldn't recommend using it yet though.

will update this issue when it lands. Would love some help. cheers

spencermountain commented 1 year ago

if you, (or anybody) was interested in working on it, the current implementation is here

it's a pretty-tricky problem. current version looks back 2 sentences for a 'he' or 'she'. i think i started to try 'they' and got overwhelmed. 'it' is most-likely the hardest. it should also chain, so 'he' looks for previous 'he' references, etc. cheers

au-re commented 1 year ago

I've been reading a bit about the topic, turns out co-reference resolution is a whole field of research :sweat_smile: I found a paper describing a nice rule based algorithm that might be a good starting point https://aclanthology.org/J13-4004.pdf

It describes a series of sieves that are applied until all mentions in a text refer to some entity.

Maybe it could work something like this:

const text = "John is a musician. He played a new song. A girl was listening to the song. 'It is my favorite', John said to her."

nlp(text).coreference().json()
[
    { terms: [...], text: "John", coreference: { refs: [1]  } }, 
    { terms: [...], text: "he", coreference: { refs: [1]  } }, 
    { terms: [...], text: "a new song", coreference: { refs: [2]  } }, 
    { terms: [...], text: "It", coreference: { refs: [2]  } }, 
    { terms: [...], text: "A girl", coreference:{ refs: [3]  } }, 
    { terms: [...], text: "the song", coreference: { refs: [2]  } }, 
    { terms: [...], text: "my", coreference:{ refs: [1] }, 
    { terms: [...], text: "her", coreference: { refs: [3]  } }, 
]

Keeping an array of references might be useful for cases where one word might refer to several entities (e.g. "they")

Here are some of the sieves described in the paper:

  1. Mention Detection
  2. Speaker Identification
  3. Exact Match
  4. Pronominal Coreference Resolution (I think this is what you have started working on)

For each mention we then try to find a matching antecedent by running it through every sieve, a sieve either resolves the match or leaves it for a later sieve.

Some additional methods might be useful to build the sieves:

nlp(text).mentions().json()
// [{ terms: [...], text: "John" }, { terms: [...], text: "It" }, { terms: [...], text: "A girl" }, { terms: [...], text: "my" }, ...]

nlp(text).speakers().json()
// [{ terms: [...], text: "John", speaker: { quote: "It is my favorite" } }]
spencermountain commented 1 year ago

hey Aurélien, thank you for sharing this. I'll read that paper this week, it looks really helpful. It would be great to work on this problem with someone.

I've got a few changes on the dev branch in advance of doing coreference. I can talk through them if you'd like, but it should land as a release next week. Mostly changes to .nouns() responses, for weird noun-phrases. There's also an awkwardly named people().guessGender() 😬.

I'm also trying to build-up a tag for people referred to not by name, called #Actor - for things like 'the bartender ... he ..', or 'my grandma ... she'. Right now it's just a bunch of professions, mostly.

i like the sketchup for the api. Let me read that paper and release these fixes then I'll ping you next week. cheers

spencermountain commented 1 year ago

okay, #Actor stuff is released in 14.8.2. Ready to start reproducing this paper, if you wanted. The api right now is this:

doc.pronouns().forEach(p=>{
  p.refersTo().debug()
})

The logic lives here and the half-passing tests are here

Lots do to! You're welcome to try someting in a branch, or make a pr to dev or something. cheers