udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
55 stars 30 forks source link

Handling empty lines #49

Closed EnisBerk closed 6 years ago

EnisBerk commented 6 years ago

Library won't handle empty lines in the files and does not provide meaningful error. Input: (00001.txt is a file with a empty line in it.)

cat ./example/00001.txt | udapy \
 read.Sentences \
 tokenize.Simple \
 udpipe.Base model_alias=en tokenize=0 \
 write.Conllu > 00001.conllu

Output:

2018-06-12 08:47:34,336 [   INFO] execute -  ---- ROUND ----
2018-06-12 08:47:34,336 [   INFO] execute - Executing block Sentences
2018-06-12 08:47:34,337 [   INFO] execute - Executing block Simple
2018-06-12 08:47:34,349 [   INFO] execute - Executing block Base
Traceback (most recent call last):
  File "/Users/berk/Documents/my_projects/NL/NLenv/bin/udapy", line 84, in <module>
    runner.execute()
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/core/run.py", line 159, in execute
    block.apply_on_document(document)
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/core/block.py", line 36, in apply_on_document
    self.process_document(document)
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/core/block.py", line 44, in process_document
    self.process_bundle(bundle)
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/core/block.py", line 32, in process_bundle
    self.process_tree(tree)
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/block/udpipe/base.py", line 102, in process_tree
    return self.tool.tag_parse_tree(root)
  File "/Users/berk/Documents/my_projects/NL/NLenv/lib/python3.6/site-packages/udapi/tool/udpipe.py", line 33, in tag_parse_tree
    for parsed_node in parsed_root.descendants:
AttributeError: 'NoneType' object has no attribute 'descendants'
martinpopel commented 6 years ago

Thanks for the feedback. I agree the error message was not very helpful. There is a question what do you expect to happen with empty lines and what should be the default behavior.

In some applications (e.g. with sentence-aligned parallel treebank where some translations are missing), we may want to allow empty trees - i.e. a tree which has only the technical root, but no nodes.

In de8d529, I have fixed udpipe.Base to ignore empty trees, so your example should not fail, but you will get empty trees in the CoNLL-U output (CoNLL-U does not formally allow empty trees, so there is one node with Empty=Yes in MISC).

In f41031de, I added few more options:

Let me know if you have any other suggestions.

EnisBerk commented 6 years ago

Thanks for the quick response. In my case I would prefer ignoring empty lines, but you are right about applications that might need to keep track of all lines. So it is helpful to provide ignore_empty_lines flag. Also you might consider adding a log about occurrence of empty trees, or cite this thread on the documentation for anyone cannot figure out why there are empty trees.

martinpopel commented 6 years ago

log about occurrence of empty tree

You can use if_empty_tree=skip_warn. Now I see I have forgotten to add delete_warn, which may be useful in some cases.

or cite this thread on the documentation

I would appreciate any help with improving the documentation (in any form: docstrings, readthedocs, the tutorial,...) via PRs. Unfortunately, I am too busy for the next few months to fix it myself.