mynlp / jigg

Pipeline framework for easy natural language processing
Apache License 2.0
74 stars 20 forks source link

Consistent error handling #32

Open hiroshinoji opened 8 years ago

hiroshinoji commented 8 years ago

Here is a proposal for how to keep track errors on the output XML when some errors are detected.

Example:

<chunks annotators="cabocha" errors="cabocha">
<error by="cabocha">error message</error>
</chunks>

That is, an error message is surrounded by <error>, which keeps the annotator causing the error.

This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:

<tokens annotators="ssplit tokenize pos" errors="pos">
<token id="0" offsetBegin="0" offsetEnd="1">I</token>
...
<error by="pos">error message</error>
</tokens>

errors attribute in each element may be redundant but seems useful to check errors. I'm not sure.

hiroshinoji commented 8 years ago

When a error is detected at higher level in the pipeline (e.g., tokenize), it seems natural that the lower level annotators (e.g., pos) annotate nothing and just ignore that sentence (or a document, if that contains sentences with errors).

Or the output keeps all <error> tags for each annotator? This seems somewhat redundant.

hiroshinoji commented 8 years ago

One problem of this approach is that, e.g., <tokens> has elements other than <token> as a child. Here is another proposal:

<sentence id="s0">
  <tokens annotators="ssplit tokenize pos" errors="e0">
  ...
  </tokens>
<erorrs>
  <error id="e0" by="pos">...</error>
</errors>
</sentence>

Another merit of this approach is that we can refer to the same error message from different elements, e.g., chunks, dependencies, etc of knp.

hiroshinoji commented 8 years ago

This is the final design now accepted in 038c85007174a68e49a188b53469a3876ed01bca.

<sentence id="s0">
  <tokens .../>
  <error annotator="knp">...</error>
</sentence>

We do not record error id, and also links between elements on which the error occurs and <error>.

Basically each annotator is agnostic about annotating <error> tag, and it is SentenceAnnotator or DocumentAnnotator that annotates <error> for a problematic sentence or document.

In the current implementation, only AnnotationError thrown in each annotator is caught, and is converted to <error> tag. This might be changed to catch all errors during annotation?

This is a concrete example, which occurs when * is given to knp and juman does not convert half space chars (-juman.normalize false).

<root>
  <document id="d0">
    <sentences>
      <sentence id="s0">
        *
        <tokens annotators="juman" normalized="false">
          <token id="s0_tok0" form="*" characterOffsetBegin="0" characterOffsetEnd="1" yomi="*" lemma="*" pos="未定義語" posId="15" pos1="その他" pos1Id="1" cType="*" cTypeId="0" cForm="*" cFormId="0" misc="NIL"/>
        </tokens>
        <error annotator="knp">jigg.pipeline.ProcessError: ;; Invalid input &lt;* * * 未定義語 15 その他 1 * 0 * 0 NIL &gt; ! # S-ID:2 KNP:4.12-CF1.1 DATE:2016/03/16 SCORE:0.00000 ERROR:Cannot make mrph EOS</error>
      </sentence>
    </sentences>
  </document>
</root>

Error message of KNP is recorded in the text of <error>.

hiroshinoji commented 8 years ago

TODO: check whether error handling works correctly for CoreNLP. One issue is that now all (sub)annotators in CoreNLP are DocumentAnnotator, which means if some error (e.g., parse error) occurs on a sentence, probably the analysis of the whole document is failed. Or unexpected behavior may occur if some error is handled (e.g., giving too long sentences?) internally in some annotator of CoreNLP?