Open hiroshinoji opened 8 years ago
When a error is detected at higher level in the pipeline (e.g., tokenize), it seems natural that the lower level annotators (e.g., pos) annotate nothing and just ignore that sentence (or a document, if that contains sentences with errors).
Or the output keeps all <error>
tags for each annotator? This seems somewhat redundant.
One problem of this approach is that, e.g., <tokens>
has elements other than <token>
as a child.
Here is another proposal:
<sentence id="s0">
<tokens annotators="ssplit tokenize pos" errors="e0">
...
</tokens>
<erorrs>
<error id="e0" by="pos">...</error>
</errors>
</sentence>
Another merit of this approach is that we can refer to the same error message from different elements, e.g., chunks
, dependencies
, etc of knp
.
This is the final design now accepted in 038c85007174a68e49a188b53469a3876ed01bca.
<sentence id="s0">
<tokens .../>
<error annotator="knp">...</error>
</sentence>
We do not record error id, and also links between elements on which the error occurs and <error>
.
Basically each annotator is agnostic about annotating <error>
tag, and it is SentenceAnnotator
or DocumentAnnotator
that annotates <error>
for a problematic sentence or document.
In the current implementation, only AnnotationError
thrown in each annotator is caught, and is converted to <error>
tag. This might be changed to catch all errors during annotation?
This is a concrete example, which occurs when *
is given to knp and juman does not convert half space chars (-juman.normalize false
).
<root>
<document id="d0">
<sentences>
<sentence id="s0">
*
<tokens annotators="juman" normalized="false">
<token id="s0_tok0" form="*" characterOffsetBegin="0" characterOffsetEnd="1" yomi="*" lemma="*" pos="未定義語" posId="15" pos1="その他" pos1Id="1" cType="*" cTypeId="0" cForm="*" cFormId="0" misc="NIL"/>
</tokens>
<error annotator="knp">jigg.pipeline.ProcessError: ;; Invalid input <* * * 未定義語 15 その他 1 * 0 * 0 NIL > ! # S-ID:2 KNP:4.12-CF1.1 DATE:2016/03/16 SCORE:0.00000 ERROR:Cannot make mrph EOS</error>
</sentence>
</sentences>
</document>
</root>
Error message of KNP is recorded in the text of <error>
.
TODO: check whether error handling works correctly for CoreNLP. One issue is that now all (sub)annotators in CoreNLP are DocumentAnnotator, which means if some error (e.g., parse error) occurs on a sentence, probably the analysis of the whole document is failed. Or unexpected behavior may occur if some error is handled (e.g., giving too long sentences?) internally in some annotator of CoreNLP?
Here is a proposal for how to keep track errors on the output XML when some errors are detected.
Example:
That is, an error message is surrounded by
<error>
, which keeps the annotator causing the error.This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:
errors
attribute in each element may be redundant but seems useful to check errors. I'm not sure.