Checking input to pipeline components

Say you invoke Tarsqi as follows:

$ python tarsqi.py pipeline=PREPROCESSOR,TOKENIZER,TAGGER,CHUNKER <infile> <outfile>

You now get a funcky duplication of lex tags where one has just the token information and the other the pos as well.

<lex id="l17" begin="161" end="166" lemma="woman" origin="PREPROCESSOR" pos="NN" text="WOMAN" />
<lex id="l17" begin="161" end="166" origin="TOKENIZER" text="WOMAN" />

What happened was that you run tokenizer, tagger and chunker because of PREPROCESSOR, and then you run them again because of TOKENIZER,TAGGER,CHUNKER. Downstream components, starting with the chunker, will break on the missing pos attributes on some of the lex tags (well, half of them).

Solution: use PREPROCESSOR or TOKENIZER,TAGGER,CHUNKER

Also, this should be made clear in the documentation.

And perhaps Tarsqi should be a bit smarter and check the input over which it runs.

tarsqi / ttk

Checking input to pipeline components #43