Closed Tooa closed 8 years ago
Hi Uli,
ParZu should preserve the input order - if it doesn't, this is a bug. What might happen is that the sentences get processed out-of-order, but they should be put together in the right order again by the wrapper script (multiprocessed_parsing.py).
I think what you want to achieve should be easiest with one-sentenc-per-line input (which is already supported), and CoNLL output format, where sentences are delimited by empty lines. I regularly post-process the CoNLL format into some one-sentence-per-line representation, e.g. for SMT ( https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/conll2mosesxml.py ), and it should be easy to do something like that for your purposes, and then copy the additional annotations from input to output (if they're both one sentence per line).
If you do find that ParZu mixes up the order, can you give me a reproducible example of this? Do you also observe it if you start ParZu singlethreaded ("-p 1")?
best wishes, Rico
On 03.03.2016 12:44, Uli Fahrer wrote:
Hi,
this is a feature request for an additional input format that also tackles the output format. I often have additional annotations like document id or sentence id for input sentences that I want to preserve in the parses. So an input could look like this:
documentId sentenceId sentence 0 0 first sentence, first document 0 1 second sentence, first document 1 0 first sentence, second document
For the output, I propose an one sentence per line format since this is easy to post process.
0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART <https://github.com/ART>@@der <https://github.com/der>@@Def <https://github.com/Def>|Masc|Nom|Sg Arzt@@NN <https://github.com/NN>@@arzt <https://github.com/arzt>@@Masc <https://github.com/Masc>|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3 <https://github.com/3>|Sg|Pres|Ind im@@APPRART@@in <https://github.com/in>@@Dat <https://github.com/Dat> Krankenhaus@@NN <https://github.com/NN>@@krankenhaus@@Neut <https://github.com/Neut>|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0) SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2) PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(*-5,*-5)
I already tried to re-add these annotations after the parsing. However, this solution doesn't work since ParZu computes in a multi-threaded manner and doesn't preserve the input order in its output.
To overcome this issue, one could introduce some sort of buffer that stores finished sentences after parsing. In order to preserve the input order the buffer could flush sentences in the correct order once a sequence is computed. Do you see the problem when the output parses don't match the input?
Best Uli
— Reply to this email directly or view it on GitHub https://github.com/rsennrich/ParZu/issues/5.
ParZu should preserve the input order - if it doesn't, this is a bug.
After some more digging, I found the issue that led to my wrong assumption that the output order is not preserved. The input file contains sometimes more than one empty space as token delimiter. Therefore, the command:
./parzu -i tokenized_lines < inprob -p 12 > prob
with inprob
as Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen .
produces:
1 Die die ART ART Def|_|Nom|Pl 2 det _ _
2 Baukosten Baukosten N NN _|Nom|Pl 3 subj _ _
3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root _ _
4 also also ADV ADV _ 3 adv _ _
5 deutlich deutlich ADV ADJD Pos| 3 pred _ _
6 mehr mehr ADV ADV _ 7 adv _ _
7 als als KOKOM KOKOM _ 3 kom _ _
8 die die ART ART Def|Fem|_|Sg 9 det _ _
9 Geschossfläche Geschossfläche N NN Fem|_|Sg 7 cj _ _
10 , , $, $, _ 0 root _ _
11 nämlich nämlich ADV ADV _ 12 adv _ _
12 um um PREP APPR _ 3 pp _ _
13 insgesamt insgesamt ADV ADV _ 14 adv _ _
14 86 86 CARD CARD _ 12 pn _ _
1 % % N NN _|Nom|_ 3 subj _ _
2 , , $, $, _ 0 root _ _
3 gestiegen steigen V VVPP _ 0 root _ _
4 . . $. $. _ 0 root _ _
This results in a non-align-able input and output file, because they are different. I suggest to change this behavior and make the tokenized_line
input format more robust.
Hello Uli,
I'm unable to reproduce your problem:
echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | ./parzu -i tokenizedlines 2> /dev/null 1 Die die ART ART Def|||Pl 2 det 2 Baukosten Baukosten N NN ||Pl 0 root 3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root 4 also also ADV ADV 3 adv 5 deutlich deutlich ADV ADJD Pos| 6 attr 6 mehr mehr PRO PIS |Nom|Pl 3 subj 7 als als KOKOM KOKOM 3 kom 8 die die ART ART Def|Fem||Sg 9 det 9 Geschossfläche Geschossfläche N NN Fem||Sg 7 cj 10 , , $, $, 0 root 11 nämlich nämlich ADV ADV 12 adv 12 um um PREP APPR 3 pp 13 insgesamt insgesamt ADV ADV 15 adv 14 86 86 CARD CARD 15 attr 15 % % N NN ||Pl 12 pn 16 , , $, $, 0 root 17 gestiegen steigen V VVPP 3 aux 18 . . $. $. 0 root
I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message? Alternatively, are you using a particularly old or new version of Python?
You can also test the tokenizer in isolation:
echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | python preprocessor/tokenized_lines.py Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen .
I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message?
Interesting, you are right. Your example works for me. However, the attached sentence [1] produces the error I mentioned before.
Can you reproduce the problem with the provided file? I use Python 2.7.10
. The character between 86
and %
looks like this one [2].
[1] https://www.dropbox.com/s/mp670vkbtdhuoof/inprob?dl=0 [2] http://www.unicodemap.org/details/0x00A0/index.html
Hi Uli,
hm, I'm tempted to just blame the bad unicode support on Python 2 (tokenized_lines.py works fine in Python 3.4.3), but I just committed a fix that should improve unicode handling in Python 2.7. I hope this solves the problem for you.
Thank you very much. Looks good to me. I was not aware that ParZu works with Python 3.
Hi :wave:,
this is a feature request for an additional input format that also tackles the output format. I often have additional annotations like a document id or a sentence id for given sentences and I want to preserve the annotations in the parses.
An input could look like this:
For the output, I propose an one sentence per line format since this is easy to post process.
I already tried to re-add the annotations after the parsing myself. However, this solution doesn't work since ParZu computes in a multi-threaded manner and doesn't preserve the input order in its output.
To overcome this issue, one could introduce some sort of buffer that stores finished sentences after parsing. In order to preserve the input order the buffer could flush sentences in the correct order once a sequence is computed. Do you see the problem when the output parses don't match the input?