Open GoogleCodeExporter opened 9 years ago
I reproduced this problem.
For the MASC sentence corpus, the MASCReader returns 1865 empty Cas's and 13754
normal Cas's
I copied a version of the MASCReader into a local project and run the following
pipeline:
String patterns = "round*/*-v/*-wn.xml";
SimplePipeline.runPipeline(
createReaderDescription(
MascReader.class,
MascReader.PARAM_IGNORE_TIES, true,
MascReader.PARAM_SOURCE_LOCATION, MASCDirectory,
MascReader.PARAM_PATTERNS, new String[] {
ResourceCollectionReaderBase.INCLUDE_PREFIX + patterns }),
createEngineDescription(LanguageToolSegmenter.class),
createEngineDescription(MascProblemFinder.class)
//createEngineDescription(CasDumpWriter.class)
);
I modified the MASCReader to return a sentence instead of an empty Cas: this is
where the problem is introduced:
// if no tie between annotators is discovered
if (documentText != null) {
setDocumentMetadata(jCas, node);
jCas.setDocumentText(documentText);
}
else {
setDocumentMetadata(jCas, node);
jCas.setDocumentText("This is an empty Cas.");
//jCas.reset(); // TODO here the CAS is emptied
}
Original comment by eckle.kohler
on 8 Dec 2014 at 8:45
I don't recall much about the MASC corpus format, so I don't have much context
to help me interpret this problem report. I take it from reading the code that
the empty CAS was returned only in those cases where there was a tie between
the annotators. Is this perhaps the intended behaviour? If not, is your
modified code above intended to fix the problem?
Original comment by tristan.miller@nothingisreal.com
on 11 Dec 2014 at 2:48
Original issue reported on code.google.com by
MedKhema...@gmail.com
on 3 Dec 2014 at 6:54Attachments: