korpling / pepperModules-MergingModule

This project provides a Pepper module for the merging of data on several possible levels.
Other
2 stars 2 forks source link

Correct way to merge 3+ sources #8

Closed amir-zeldes closed 8 years ago

amir-zeldes commented 8 years ago

I can't seem to merge 3 source formats, though combinations of 2 seem to work.

If I configure all 3 source importers, then use one merging manipulator, the third source doesn't appear in the result, but there's no error message. If I use the manipulator module twice (importers 1+2, then merge, then importer 3, then merge), I get this error:

CONVERSION ENDED WITH ERRORS, REQUIRED TIME: 00:00:01.297 s
   Error in Pepper module 'Merger, 1.0.2', please contact the module supplier saltnpepper@lists.hu-berlin.de. Some docum
   ents are still in the processing queue by module 'Merger' and neither set to 'COMPLETED', 'DELETED' or 'FAILED'. Rema
   ining documents are: [salt:/0/GUM/GUM_interview_ants: IN_PROGRESS] (PepperModuleException)
full stack trace:
org.corpus_tools.pepper.modules.exceptions.PepperModuleException: Error in Pepper module 'Merger, 1.0.2', please contact the module supplier saltnpepper@lists.hu-berlin.de. Some documents are still in the processing queue by module 'Merger' and neither set to 'COMPLETED', 'DELETED' or 'FAILED'. Remaining documents are: [salt:/0/GUM/GUM_interview_ants: IN_PROGRESS]
        at org.corpus_tools.pepper.core.ModuleControllerImpl$2.run(ModuleControllerImpl.java:274)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

What is the correct way to merge from 3 importers?

thomaskrause commented 8 years ago

The correct way should be to define all 3 importers and only use the merging module once, e.g.

<?xml version='1.0' encoding='UTF-8'?>
<pepper-job version="1.0">
    <importer name="CoNLLImporter" path="./conll/">
    </importer>
    <importer name="PTBImporter" path="./ptb/">
    </importer>
    <importer name="RSTImporter" path="./rs3/">
    </importer>
    <manipulator name="Merger">
    </manipulator>
    <exporter name="PAULAExporter" path="./merged/paula/">
    </exporter>
</pepper-job>

Are you sure there were no error messages? Are all files correctly imported when used without merging?

amir-zeldes commented 8 years ago

This took some debugging, but I figured it out: It had to do with inability to map base texts, because the PTB format uses the string -LRB-/-RRB- for actual round brackets, which cannot be expressed as tokens in the format.

When I was putting PTB together with other formats in the same merge, the merge would fail and ignore everything after the PTB data (the order of the importers mattered since I'm using firstAsBase), however there is no error message (this could maybe be an improvement in the merging module, to warn if merging texts is impossible). If I did a pairwise merge, the merger complained that there was nothing to merge and the merging module threw an exception. Arguably, even if the merger finds two inputs to merge, it should realize that if there is a problem if there is a third input that could not be merged.

I was able to solve the concrete problem using the escapeMapping property, but maybe modifying the error or warning message behavior is desirable.