Closed heatherleaf closed 3 years ago
The problem is that treetagger doesn't return a 3-column row for the token "
$ tree-tagger -token -lemma -no-unknown "/Users/peter/Library/Application Support/sparv/models/treetagger/eng.par" < input.txt
reading parameters ...
tagging ...
The DT the
primary JJ primary
aim NN aim
will MD will
be VB be
to TO to
contribute VV contribute
to TO to
the DT the
goals NNS goal
of IN of
the DT the
Nordic JJ Nordic
Initiative NP Initiative
for IN for
Solar NP Solar
Fuel NP Fuel
Development NP Development
(N-I-S-F-D:) NN (N-I-S-F-D:)
<nisfd.com>
. SENT .
So the following should be fixed:
https://github.com/spraakbanken/sparv-pipeline/blob/0fe5f27d0d82548ecc6cb21a69289668aac54cf1/sparv/modules/treetagger/treetagger.py#L70-L74
It cannot assume that tagged_token.strip().split(TAG_SEP)
always has 3 elements.
Suggestion: If column TAG_COLUMN
or LEM_COLUMN
is missing, use the wordform as a backoff.
Fixed in 5381691. When TreeTagger can't produce a POS tag, the tag will be empty. When there is no lemma the wordform is used instead. In both cases Sparv will produce a warning.
Sparv throws an exception for some sentences when running tree-tagger:
The config file is this:
And the corpus is this single sentence: