Coreference resolution - Githubissues

anitzkin commented 9 years ago

As most of you know, I have been trying to be in the process of putting together a "simple" demo of how current NLP + pattern-matcher = the ability to input data, ask, and answer simple questions in natural language. I have been held up for days now trying to work around the fact that the pattern matcher cannot match the question with the answer because the referents have different id's -- i.e. if "John ate a pizza." and I ask "What did John eat?" There are two different Johns and two different eats in the two setlinks (or evaluationlinks) we are trying to match (and the non-specific versions of them cannot be isolated for the pattern-matcher). We thought we had a solution in the office the other day -- an experimental scheme routine William wrote for creating a non-specific version of the r2l atoms in the atomspace at a given moment -- and I think something like this could still work, but I or he will have to write a different one, because the output of the one wrote is very far from suitable for these purposes . . . meanwhile before getting to work on that, it occurred to me that this approach is really a major hack and perhaps it would be best to also consider what revisions would be required to the pattern matcher and / or the representation of atoms, as this is a really fundamental problem that will need to be solved for real in the long run . . .

It seems to me that sooner or later the pattern matcher is going to need these functionalities anyway: to be able (1) to match atoms by name, while ignoring the id's, and / or (2) be able to match by only the non-specific atoms in the sub-hyper-graph -- that is filter the specific atoms out of consideration. This raises an issue with the way the atoms are named. Although of course it is possible to write code to take apart the name and its id, it makes one wonder if they shouldn't be more basically separable, rather than being the concatenated strings they are now?

Meanwhile I guess, I will see if I can figure out a way to hack around the problem . . .

linas commented 9 years ago

Noooooooooooo ...................

There's already an infrastructure for this, it was written by a GSOC summer school student, under the title "anaphora resolution", although it is (supposed to be) more general than that. The idea of anaphora resolution is to take the sentence pair "John ate a pizza. What did he eat?" and infer that "he" refers to "John". When the student was wrapping up the project, it seemed to actually do that. I'm not quite sure, but I think he also had code to handle "John ate a pizza. What did John eat?" and resolve the two Johns to the same thing. We may not have tested that case.... (yes, there are unit tests for this code, somehwere..)

Anyway, I asked him to design it in a flexible manner: at the most basic level, it does simple checks: string compars for names, gender compares for he/she/it, and also singular/plural compares. That's about all that could be easily done over the summer. The intent of having a framework is to someday replace the simplistic string/gender/number compare with more sophisticated reasoning.

So .. don't hack, and don't even think of touching the pattern matcher. There's a better way.

anitzkin commented 9 years ago

Okay . . . and btw, looking at it again, I think William's "abstract-version" may be easier to use for this case than I thought . . . nevertheless, the solution you describe would be more principled, if you want to locate that . . . ?

linas commented 9 years ago

Why, here: https://github.com/opencog/opencog/tree/master/opencog/nlp/anaphora

The primary issue is, of course, that it is not integrated with R2L ... partly because R2L was immature during GSOC, and it also added one more layer that confused the issues. So it currently works straight from relex parses. Author was Hujie Wang, an undergraduate, sophmore or junior year. He did a decent job, giventhe overall infrastructure he got to deal with.

opencog / opencog

Coreference resolution #1118