Open pmanion0 opened 9 years ago
Description: You should implement a better baseline system, BetterBaseline.java, to familiarize yourself with the task and the code framework. Feel free to experiment with different methods for clustering, which can optionally be guided by statistics you gather from training. These will also be useful in your actual system, either as a basis for your rule-based system, or as insight for features to a classifier.
An example of a baseline system is a head-matching system: at training time the system could accumulate statistics on the heads of the parses of mentions, and remember which heads are coreferent with each other. This will gather information like the leader and the president are coreferent, if leader and president were heads of coreferent mentions at training time.
You can also consider handling pronouns separately (I, she, etc. likely deserve special attention), modeling distance, or a number of other ideas.
String Identity Matching: This is just a basic string match between the entity and the cluster. For example, if President Barack Obama appeared once, we would create a cluster for that and this rule would automatically add any future occurences of President Barack Obama to the existing cluster (rather than a new singleton).
@jeasenrys This is probably a good warm-up one for Java (and why you can't use string1 == string2 :smile: )
Deterministic Head-Matching: This is something like a hard rule when the majority of words in a mention overlap, we'll consider them the same entity. For example, the text may initially mention the late legendary musician Michael Jackson and then later say Michael Jackson - these probably mean the same thing. However, we might need some rules around this based on the position of the matching words, the percent that overlap, or even whether the words are capitalized (seems important?).
Lowercase and Partial Match: just committed this. Making case consistent does improve the performance compared to the BaselineCoreferenceSystem, but partial match doesn't work as good as exact match (I tuned the parameter a bit, and it gets close. I am thinking of collecting some stats from the training data). Just pushed this to the github as a rough start point (coding can be naive...). Will try some other ideas above and discussed.
Fuzzy Head-Matching: Use the training data to find words that are synonyms and may refer to each other even if the words do not explicitly appear in the same mention in the test data.
Result: Training data is way too small to get good estimates here.
Tasks: