mynlp / jigg

Pipeline framework for easy natural language processing
Apache License 2.0
74 stars 20 forks source link

Resolving annotation conflicts #27

Open hiroshinoji opened 8 years ago

hiroshinoji commented 8 years ago

Currently, if we apply two annotators which annotate the same element, both are added to the result. Stanford CoreNLP instead overrides the old annotation. Following this, I implemented a method that checks whether there already exist the same elements when adding XML elements. Such duplicate occurs, e.g., when running a joint parser of POS and tree after applying POS tagger.

I plan to push this modification but I was also wondering this overriding method is the best way to resolve conflicts. Maybe it's better also to output some warnings, but this may be future work.

hiroshinoji commented 8 years ago

Now CabochaAnnotator replaces the old annotations (chunks and dependencies) if exist. https://github.com/mynlp/jigg/commit/323a3b0c8e20802e6fcd640a7f1070e1f35dfff5#diff-9b2b4b9eb3146599a3ce60c12afa4ddeR46

hiroshinoji commented 8 years ago

Another option:

Anyway, each annotation should have an attribute recording the used annotator, e.g.:

<tokens annotators="juman">...</tokens>
<tokens annotators="knp">...</tokens>
hiroshinoji commented 8 years ago

I've changed this behavior of cabocha in d17b7511251c21be2f9be0de812227d4375a6b97 to remain the old annotation, because now annotator name (cabocha) is recorded on every element.

It may be better to support some option to decide whether leaving or replacing the old annotation as in -knp.replaceJumanTokens.

Generally, remaining the same type of annotations with different annotators seems to make the lower-level processing a bit complicated, so the default behavior might be better to replace the old annotation.