Closed EnisBerk closed 6 years ago
Thanks for the feedback. I agree the error message was not very helpful. There is a question what do you expect to happen with empty lines and what should be the default behavior.
In some applications (e.g. with sentence-aligned parallel treebank where some translations are missing), we may want to allow empty trees - i.e. a tree which has only the technical root, but no nodes.
In de8d529, I have fixed udpipe.Base
to ignore empty trees, so your example should not fail, but you will get empty trees in the CoNLL-U output (CoNLL-U does not formally allow empty trees, so there is one node with Empty=Yes
in MISC).
In f41031de, I added few more options:
read.Sentences
ignore_empty_lines=1 udpipe.Base model_alias=en tokenize=0 write.Conllu
... This will skip the empty lines in the input file, so sentences will be numbered 1,2,3... without any gaps.read.Sentences tokenize.Simple
if_empty_tree=delete udpipe.Base model_alias=en tokenize=0 write.Conllu
... This will delete the empty trees, so there will be gaps in the sent_id numbers.read.Sentences tokenize.Simple udpipe.Base model_alias=en tokenize=0 write.Conllu
if_empty_tree=skip ... Empty trees will be there all the time, just the writer block will ignore them (skip when printing). This has the same effect as the previous example (again there will be gaps in the sent_id numbers), except that tokenize.Simple
and udpipe.Base
will be applied on the empty trees (but thanks to de8d529 it will ignore them).Let me know if you have any other suggestions.
Thanks for the quick response. In my case I would prefer ignoring empty lines, but you are right about applications that might need to keep track of all lines. So it is helpful to provide ignore_empty_lines flag. Also you might consider adding a log about occurrence of empty trees, or cite this thread on the documentation for anyone cannot figure out why there are empty trees.
log about occurrence of empty tree
You can use if_empty_tree=skip_warn
. Now I see I have forgotten to add delete_warn
, which may be useful in some cases.
or cite this thread on the documentation
I would appreciate any help with improving the documentation (in any form: docstrings, readthedocs, the tutorial,...) via PRs. Unfortunately, I am too busy for the next few months to fix it myself.
Library won't handle empty lines in the files and does not provide meaningful error. Input: (00001.txt is a file with a empty line in it.)
Output: