spraakbanken / sparv-pipeline

SprΓ₯kbanken's text analysis tool
https://spraakbanken.gu.se/sparv
MIT License
25 stars 6 forks source link

Tree-tagger error for some tokens #107

Closed heatherleaf closed 3 years ago

heatherleaf commented 3 years ago

Sparv throws an exception for some sentences when running tree-tagger:

$ sparv run

🐦 ━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━  67% 0:00:02 treetagger:annotateTraceback (most recent call last):
  File "/Users/peter/Documents/Genie-projekt/local-data/Jobbannonser/sparv/treetagger-error/.snakemake/scripts/tmpwcvme7nn.run_snake.py", line 91, in <module>
    registry.modules[module_name].functions[f_name]["function"](**parameters)
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/sparv/modules/treetagger/treetagger.py", line 72, in annotate
    tag = tagged_token.strip().split(TAG_SEP)[TAG_COLUMN]
IndexError: list index out of range

Error in rule treetagger::annotate:
    jobid: 5
    output: sparv-workdir/error-text/segment.token/treetagger.upos, sparv-workdir/error-text/segment.token/treetagger.pos, sparv-workdir/error-text/segment.token/treetagger.baseform

RuleException:
CalledProcessError in line 81 of /Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/sparv/core/Snakefile:
Command 'set -euo pipefail;  /Users/peter/.local/pipx/venvs/sparv-pipeline/bin/python /Users/peter/Documents/Genie-projekt/local-data/Jobbannonser/sparv/treetagger-error/.snakemake/scripts/tmpwcvme7nn.run_snake.py' returned non-zero exit status 1.
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2208, in run_wrapper
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/sparv/core/Snakefile", line 81, in __rule__
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 551, in _callback
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 537, in cached_or_run
  File "/Users/peter/.local/pipx/venvs/sparv-pipeline/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2239, in run_wrapper
Exiting because a job execution failed. Look above for error message

The config file is this:

export:
  annotations:
    - TREETAGGER.all

And the corpus is this single sentence:

The primary aim will be to contribute to the goals of the Nordic Initiative for Solar Fuel Development (N-I-S-F-D:) <nisfd.com> .
heatherleaf commented 3 years ago

The problem is that treetagger doesn't return a 3-column row for the token "":

$ tree-tagger -token -lemma -no-unknown "/Users/peter/Library/Application Support/sparv/models/treetagger/eng.par" < input.txt 
    reading parameters ...
    tagging ...
The DT  the
primary JJ  primary
aim NN  aim
will    MD  will
be  VB  be
to  TO  to
contribute  VV  contribute
to  TO  to
the DT  the
goals   NNS goal
of  IN  of
the DT  the
Nordic  JJ  Nordic
Initiative  NP  Initiative
for IN  for
Solar   NP  Solar
Fuel    NP  Fuel
Development NP  Development
(N-I-S-F-D:)    NN  (N-I-S-F-D:)
<nisfd.com>
.   SENT    .
heatherleaf commented 3 years ago

So the following should be fixed: https://github.com/spraakbanken/sparv-pipeline/blob/0fe5f27d0d82548ecc6cb21a69289668aac54cf1/sparv/modules/treetagger/treetagger.py#L70-L74 It cannot assume that tagged_token.strip().split(TAG_SEP) always has 3 elements.

Suggestion: If column TAG_COLUMN or LEM_COLUMN is missing, use the wordform as a backoff.

anne17 commented 3 years ago

Fixed in 5381691. When TreeTagger can't produce a POS tag, the tag will be empty. When there is no lemma the wordform is used instead. In both cases Sparv will produce a warning. image