saad120 / dkpro-wsd

Automatically exported from code.google.com/p/dkpro-wsd
0 stars 0 forks source link

MASCReader returns empty CASes #65

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
MASCReader generates an empty CAS for some files in the corpus

Original issue reported on code.google.com by MedKhema...@gmail.com on 3 Dec 2014 at 6:54

Attachments:

GoogleCodeExporter commented 9 years ago
I reproduced this problem.

For the MASC sentence corpus, the MASCReader returns 1865 empty Cas's and 13754 
normal Cas's

I copied a version of the MASCReader into a local project and run the following 
pipeline:

String patterns = "round*/*-v/*-wn.xml";
        SimplePipeline.runPipeline(
                createReaderDescription(
                        MascReader.class,
                        MascReader.PARAM_IGNORE_TIES, true,
                        MascReader.PARAM_SOURCE_LOCATION, MASCDirectory,
                        MascReader.PARAM_PATTERNS,  new String[] {
                                ResourceCollectionReaderBase.INCLUDE_PREFIX + patterns }),
                createEngineDescription(LanguageToolSegmenter.class),
                createEngineDescription(MascProblemFinder.class)
                //createEngineDescription(CasDumpWriter.class)
                );

I modified the MASCReader to return a sentence instead of an empty Cas: this is 
where the problem is introduced:

        // if no tie between annotators is discovered
        if (documentText != null) {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText(documentText);
        }
        else {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText("This is an empty Cas.");

            //jCas.reset(); // TODO here the CAS is emptied
        }

Original comment by eckle.kohler on 8 Dec 2014 at 8:45

GoogleCodeExporter commented 9 years ago
I don't recall much about the MASC corpus format, so I don't have much context 
to help me interpret this problem report.  I take it from reading the code that 
the empty CAS was returned only in those cases where there was a tie between 
the annotators.  Is this perhaps the intended behaviour?  If not, is your 
modified code above intended to fix the problem?

Original comment by tristan.miller@nothingisreal.com on 11 Dec 2014 at 2:48