udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
55 stars 30 forks source link

Lost SpaceAfter=No of a multi-word token #82

Closed dan-zeman closed 3 years ago

dan-zeman commented 3 years ago

Input (from UD_Turkish-PUD/tr_pud-ud-test.conllu):

# newdoc id = n01003
# sent_id = n01003007
# text = Maksimum miktar kişi başı 5.000 dolardır.
# text_en = $5,000 per person, the maximum allowed.
1       Maksimum        Maksimum        ADJ     JJ      Number=Sing     2       amod    _       _
2       miktar  miktar  NOUN    NN      Case=Nom|Number=Sing    6       nsubj   _       _
3       kişi    kişi    NOUN    NN      Number=Sing     4       nmod:poss       _       _
4       başı    baş     NOUN    NN      Number=Sing|Number[psor]=Sing|Person[psor]=3    6       amod    _       _
5       5.000   5.000   NUM     CD      Number=Sing     6       nummod  _       _
6-7     dolardır        _       _       _       _       _       _       _       SpaceAfter=No
6       dolar   do      NOUN    NN      Number=Sing     0       root    _       _
7       dır     i       AUX     AUX     Aspect=Perf|Mood=Gen|Number=Sing|Person=3|Tense=Pres    6       cop     _       _
8       .       .       PUNCT   .       _       6       punct   _       _

Command:

cat input.conllu | udapy -s > output.conllu

Output:

# newdoc id = n01003
# sent_id = n01003007
# text = Maksimum miktar kişi başı 5.000 dolardır.
# text_en = $5,000 per person, the maximum allowed.
1       Maksimum        Maksimum        ADJ     JJ      Number=Sing     2       amod    _       _
2       miktar  miktar  NOUN    NN      Case=Nom|Number=Sing    6       nsubj   _       _
3       kişi    kişi    NOUN    NN      Number=Sing     4       nmod:poss       _       _
4       başı    baş     NOUN    NN      Number=Sing|Number[psor]=Sing|Person[psor]=3    6       amod    _       _
5       5.000   5.000   NUM     CD      Number=Sing     6       nummod  _       _
6-7     dolardır        _       _       _       _       _       _       _       _
6       dolar   do      NOUN    NN      Number=Sing     0       root    _       _
7       dır     i       AUX     AUX     Aspect=Perf|Mood=Gen|Number=Sing|Person=3|Tense=Pres    6       cop     _       _
8       .       .       PUNCT   .       _       6       punct   _       _

martinpopel commented 3 years ago

Many thanks for reporting this bug (introduced by myself recently when making write.Conllu faster).

dan-zeman commented 3 years ago

Thanks for the fix!