Feature: Allow additional annotations for input sentences

Tooa commented 8 years ago

Hi :wave:,

this is a feature request for an additional input format that also tackles the output format. I often have additional annotations like a document id or a sentence id for given sentences and I want to preserve the annotations in the parses.

An input could look like this:

documentId sentenceId sentence 0 0 first sentence, first document 0 1 second sentence, first document 1 0 first sentence, second document

For the output, I propose an one sentence per line format since this is easy to post process.

0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART@@der@@Def|Masc|Nom|Sg Arzt@@NN@@arzt@@Masc|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3|Sg|Pres|Ind im@@APPRART@@in@@Dat Krankenhaus@@NN@@krankenhaus@@Neut|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0) SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2) PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(-5,-5)

I already tried to re-add the annotations after the parsing myself. However, this solution doesn't work since ParZu computes in a multi-threaded manner and doesn't preserve the input order in its output.

To overcome this issue, one could introduce some sort of buffer that stores finished sentences after parsing. In order to preserve the input order the buffer could flush sentences in the correct order once a sequence is computed. Do you see the problem when the output parses don't match the input?

rsennrich commented 8 years ago

Hi Uli,

ParZu should preserve the input order - if it doesn't, this is a bug. What might happen is that the sentences get processed out-of-order, but they should be put together in the right order again by the wrapper script (multiprocessed_parsing.py).

I think what you want to achieve should be easiest with one-sentenc-per-line input (which is already supported), and CoNLL output format, where sentences are delimited by empty lines. I regularly post-process the CoNLL format into some one-sentence-per-line representation, e.g. for SMT ( https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/conll2mosesxml.py ), and it should be easy to do something like that for your purposes, and then copy the additional annotations from input to output (if they're both one sentence per line).

If you do find that ParZu mixes up the order, can you give me a reproducible example of this? Do you also observe it if you start ParZu singlethreaded ("-p 1")?

best wishes, Rico

On 03.03.2016 12:44, Uli Fahrer wrote:

Hi,

this is a feature request for an additional input format that also tackles the output format. I often have additional annotations like document id or sentence id for input sentences that I want to preserve in the parses. So an input could look like this:
documentId sentenceId sentence
0 0 first sentence, first document
0 1 second sentence, first document
1 0 first sentence, second document
For the output, I propose an one sentence per line format since this is easy to post process.
0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART
<https://github.com/ART>@@der <https://github.com/der>@@Def
<https://github.com/Def>|Masc|Nom|Sg Arzt@@NN
<https://github.com/NN>@@arzt <https://github.com/arzt>@@Masc
<https://github.com/Masc>|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3
<https://github.com/3>|Sg|Pres|Ind im@@APPRART@@in
<https://github.com/in>@@Dat <https://github.com/Dat>
Krankenhaus@@NN <https://github.com/NN>@@krankenhaus@@Neut
<https://github.com/Neut>|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0)
SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2)
PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(*-5,*-5)
I already tried to re-add these annotations after the parsing. However, this solution doesn't work since ParZu computes in a multi-threaded manner and doesn't preserve the input order in its output.

To overcome this issue, one could introduce some sort of buffer that stores finished sentences after parsing. In order to preserve the input order the buffer could flush sentences in the correct order once a sequence is computed. Do you see the problem when the output parses don't match the input?

Best Uli

— Reply to this email directly or view it on GitHub https://github.com/rsennrich/ParZu/issues/5.

Tooa commented 8 years ago

ParZu should preserve the input order - if it doesn't, this is a bug.

After some more digging, I found the issue that led to my wrong assumption that the output order is not preserved. The input file contains sometimes more than one empty space as token delimiter. Therefore, the command:

./parzu -i tokenized_lines < inprob -p 12 > prob with inprob as Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen .

produces:

1   Die die ART ART Def|_|Nom|Pl    2   det _   _ 
2   Baukosten   Baukosten   N   NN  _|Nom|Pl    3   subj    _   _ 
3   sind    sein    V   VAFIN   3|Pl|Pres|Ind   0   root    _   _ 
4   also    also    ADV ADV _   3   adv _   _ 
5   deutlich    deutlich    ADV ADJD    Pos|    3   pred    _   _ 
6   mehr    mehr    ADV ADV _   7   adv _   _ 
7   als als KOKOM   KOKOM   _   3   kom _   _ 
8   die die ART ART Def|Fem|_|Sg    9   det _   _ 
9   Geschossfläche Geschossfläche N   NN  Fem|_|Sg    7   cj  _   _ 
10  ,   ,   $,  $,  _   0   root    _   _ 
11  nämlich    nämlich    ADV ADV _   12  adv _   _ 
12  um  um  PREP    APPR    _   3   pp  _   _ 
13  insgesamt   insgesamt   ADV ADV _   14  adv _   _ 
14  86  86  CARD    CARD    _   12  pn  _   _ 

1   %   %   N   NN  _|Nom|_ 3   subj    _   _ 
2   ,   ,   $,  $,  _   0   root    _   _ 
3   gestiegen   steigen V   VVPP    _   0   root    _   _ 
4   .   .   $.  $.  _   0   root    _   _

This results in a non-align-able input and output file, because they are different. I suggest to change this behavior and make the tokenized_line input format more robust.

rsennrich commented 8 years ago

Hello Uli,

I'm unable to reproduce your problem:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | ./parzu -i tokenizedlines 2> /dev/null 1 Die die ART ART Def|||Pl 2 det 2 Baukosten Baukosten N NN ||Pl 0 root 3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root 4 also also ADV ADV 3 adv 5 deutlich deutlich ADV ADJD Pos| 6 attr 6 mehr mehr PRO PIS |Nom|Pl 3 subj 7 als als KOKOM KOKOM 3 kom 8 die die ART ART Def|Fem||Sg 9 det 9 Geschossfläche Geschossfläche N NN Fem||Sg 7 cj 10 , , $, $, 0 root 11 nämlich nämlich ADV ADV 12 adv 12 um um PREP APPR 3 pp 13 insgesamt insgesamt ADV ADV 15 adv 14 86 86 CARD CARD 15 attr 15 % % N NN ||Pl 12 pn 16 , , $, $, 0 root 17 gestiegen steigen V VVPP 3 aux 18 . . $. $. 0 root

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message? Alternatively, are you using a particularly old or new version of Python?

You can also test the tokenizer in isolation:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | python preprocessor/tokenized_lines.py Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen .

Tooa commented 8 years ago

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message?

Interesting, you are right. Your example works for me. However, the attached sentence [1] produces the error I mentioned before.

Can you reproduce the problem with the provided file? I use Python 2.7.10. The character between 86 and % looks like this one [2].

[1] https://www.dropbox.com/s/mp670vkbtdhuoof/inprob?dl=0 [2] http://www.unicodemap.org/details/0x00A0/index.html

rsennrich commented 8 years ago

Hi Uli,

hm, I'm tempted to just blame the bad unicode support on Python 2 (tokenized_lines.py works fine in Python 3.4.3), but I just committed a fix that should improve unicode handling in Python 2.7. I hope this solves the problem for you.

Tooa commented 8 years ago

Thank you very much. Looks good to me. I was not aware that ParZu works with Python 3.

rsennrich / ParZu

Feature: Allow additional annotations for input sentences #5