Closed amir-zeldes closed 8 years ago
The correct way should be to define all 3 importers and only use the merging module once, e.g.
<?xml version='1.0' encoding='UTF-8'?>
<pepper-job version="1.0">
<importer name="CoNLLImporter" path="./conll/">
</importer>
<importer name="PTBImporter" path="./ptb/">
</importer>
<importer name="RSTImporter" path="./rs3/">
</importer>
<manipulator name="Merger">
</manipulator>
<exporter name="PAULAExporter" path="./merged/paula/">
</exporter>
</pepper-job>
Are you sure there were no error messages? Are all files correctly imported when used without merging?
This took some debugging, but I figured it out: It had to do with inability to map base texts, because the PTB format uses the string -LRB-/-RRB- for actual round brackets, which cannot be expressed as tokens in the format.
When I was putting PTB together with other formats in the same merge, the merge would fail and ignore everything after the PTB data (the order of the importers mattered since I'm using firstAsBase
), however there is no error message (this could maybe be an improvement in the merging module, to warn if merging texts is impossible). If I did a pairwise merge, the merger complained that there was nothing to merge and the merging module threw an exception. Arguably, even if the merger finds two inputs to merge, it should realize that if there is a problem if there is a third input that could not be merged.
I was able to solve the concrete problem using the escapeMapping
property, but maybe modifying the error or warning message behavior is desirable.
I can't seem to merge 3 source formats, though combinations of 2 seem to work.
If I configure all 3 source importers, then use one merging manipulator, the third source doesn't appear in the result, but there's no error message. If I use the manipulator module twice (importers 1+2, then merge, then importer 3, then merge), I get this error:
What is the correct way to merge from 3 importers?