gregdurrett / berkeley-entity

The Berkeley Entity Resolution System jointly solves the problems of named entity recognition, coreference resolution, and entity linking with a feature-rich discriminative model.
GNU General Public License v3.0
185 stars 35 forks source link

odd wikification behavior #2

Closed stackflow2 closed 9 years ago

stackflow2 commented 9 years ago

Hi, Thanks for making this code available! I'm trying it on some fake text and getting unexpected results in the wikification. I have the following meaningless blah.txt (just playing around with different types of entities):

Michael Jackson was born in the United Kingdom and his dog was born in Japan.  He became president of Microsoft in March 2016.  Jackson owns a golf course in the UK and loves to listen to Freaky Girl.

I made a WikipediaInterface that includes a bunch of entities, including most of those in blah.txt.

Running the Driver produces the following output-wiki.conll:

#begin document (test2/text/blah.txt); part 000
(Michael Jackson*
*)
*
*
*
(United Kingdom*
*
*)
*
(Dog -LRB-zodiac-RRB-(-EXCLUDE-*)
*)
*
*
*
(Japan*)
*

(-EXCLUDE-*)
*
(President of the United States*
*
(-NIL-*))
*
(-NIL-*
*)
*

(Lauren Jackson*)
*
(-NIL-*
*
*
*
(-NIL-*
*))
*
*
*
*
*
(-NIL-*
*)
*

#end document

Questions: 1) Why would it guess "Lauren Jackson" for the last "Jackson"? The coreference system knows that these are the same reference id, so I could feasibly resolve there. But I'm also wondering why it might pick Lauren Jackson, given my wikipediaInterface -- here's what queryDisambigs is giving:

ArrayBuffer([Jackson, Mississippi : 1,269, Jackson, Michigan : 357, Edwin Jackson : 346, Jackson, Tennessee : 315, Lauren 
Jackson : 269, Jackson County, Missouri : 227, Jackson County, West Virginia : 146, ...])

2) Similarly, not sure how "Dog (zodiac)" skipped over "Dog", given queryDisambigs:

ArrayBuffer([], [Dog : 927, Dog (zodiac) : 173, Hurricane Dog (1950) : 7, Dog (film) : 4, Dog (album) : 4, Police dog : 3, Dog meat : 3, Dog (single) : 2, Dog (band) : 2], [])

3) "President of the United States" is wrong, and it misses "the UK" ...

4) Do you have code that gives the single most likely wikipedia entity for all references with a particular id? e.g. "Michael Jackson", "his", "He", and "Jackson" are all resolved by coreference, but with different wikiChunks (Michael Jackson vs Lauren Jackson). I would think, since you're doing coref & NER jointly, you'd have that functionality but I haven't found it.

I've been trying to debug, but it gets a bit opaque once I get into the BP nodes.

gregdurrett commented 9 years ago

Hi,

Thanks for your interest!

The system that runs on raw data is pretty bad at Wikification because this is treated as a latent variable during learning; that model isn't trained with any knowledge of gold-standard Wikipedia labels. The ACE model is much better, but that requires ACE-style mentions to be fed in during preprocessing, this being an unfortunate byproduct of how the system was evaluated given datasets that are available. We're currently working on a standalone Wikification component that should be able to do well more broadly, but that won't be available for at least a few months.

To answer your questions:

1) The data that it the system is trained on often has coreferent things have different Wikipedia links (e.g. Barack_Obama and President_of_the_United_States), so this isn't enforced as a hard constraint. I agree this is a bad mistake but it's not easily fixable.

2) There are features that look at things like parentheticals in the Wikipedia titles; in this case it leads to a silly error but the system does sometimes have to skip over the most obvious thing to get the correct answer, hence why it has learned to do this.

3) President is so often used to refer to the President of the US that I'm not surprised this error happens. Using contextual information to resolve this appropriately is a current research problem.

4) No code for this, sorry. You could write a separate module that reads the documents back in after prediction (using ConllDocReader) and does a kind of voting / prefers the label given to the first mention.

Unfortunately, most of these issues are nontrivial to fix / require a richer model rather than simply debugging. But thank you for bringing them to my attention, and I appreciate the interest!

Greg

stackflow2 commented 9 years ago

Thanks @gregdurrett. Yes, sounds like I'll have to work around for the time being -- but still really appreciate that you put out this code.

Just for understanding though, let me follow up on #2. So I understand that sometimes it needs to skip the most obvious thing. But at at high level, what are the signals that are driving the skipping? i.e. What signal might make it skip over "Dog" and go to "Dog (zodiac)", when Dog has a stronger prior in the wikiDB and when it's presumably a better textual match?

(Similarly but less surprisingly, I'm wondering what signals might make it skip to Lauren Jackson, over "Jackson, Mississippi" or "Edwin Jackson" in the wikiDB).

gregdurrett commented 9 years ago

1) The features for Wikification are described in edu.berkeley.nlp.entity.wiki.QueryChooser (not the best name). It's possible that many links to articles without parentheticals (e.g. "Dog") are bad, and the system has features on this so I'm guessing it has somehow learned to prefer Title (X) over Title even when Title is ranked more highly. I'm not sure, though. Also, the prior is only used to rank choices, so even an overwhelming preference for the first choice vs. the second choice won't necessarily translate into the first choice winning (not necessarily the best decision on my part).

2) Not sure why it picked Lauren Jackson instead of Edwin Jackson, but skipping Jackson, Mississippi might actually be the system working as intended: it knows that Jackson is coreferent with Michael Jackson, who is of type person, so features targeting agreement between NER and the entity link should prefer a link to an article about a person.