nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Parse error of Italian #40

Closed gifdog97 closed 10 months ago

gifdog97 commented 2 years ago

I used Italian model for predicting the dependency tree and obtained following result:

1   Il  il  DET RD  Definite=Def|Gender=Masc|Number=Sing|PronType=Art   2   det _
2   termine termine NOUN    S   Gender=Masc|Number=Sing 8   nsubj:pass  _   _
3   "   "   PUNCT   FB  _   4   punct   _   _
4   Tathāgata   Tathāgata   PROPN   SP  _   2   nmod    _   _
5   "   "   PUNCT   FB  _   4   punct   _   _
6   può potere  AUX VM  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   8   aux _
7   essere  essere  AUX VA  VerbForm=Inf    8   aux:pass    _   _
8   letto   leggere VERB    V   Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part    0   root    _
9   come    come    ADP E   _   11  case    _   _
10  "   "   PUNCT   FB  _   11  punct   _   _
11  tathā-gata  tathā-gata  NOUN    S   Gender=Fem|Number=Sing  8   obl _   _
12  "   "   PUNCT   FB  _   11  punct   _   _
13  o   o   CCONJ   CC  _   16  cc  _   _
14  come    come    ADP E   _   16  case    _   _
15  "   "   PUNCT   FB  _   16  punct   _   _
16  Tathā-āgata Tathā-āgata PROPN   SP  _   11  conj    _   _
17  "   "   PUNCT   FB  _   16  punct   _   _
18  ,   ,   PUNCT   FF  _   16  punct   _   _
19  dove    dove    ADV B   _   22  advmod  _   _
20  il  il  DET RD  Definite=Def|Gender=Masc|Number=Sing|PronType=Art   21  det _
21  primo   primo   ADJ NO  Gender=Masc|Number=Sing|NumType=Ord 22  nsubj   _   _
22  significa   significare VERB    V   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   16  acl:relcl   _   _
23  "   "   PUNCT   FB  _   25  punct   _   _
24  così    così    ADV B   _   25  advmod  _   _
25  andato  andare  VERB    V   Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part    22  xcomp   _
26  "   "   PUNCT   FB  _   25  punct   _   _
27  mentre  mentre  CCONJ   CC  _   30  cc  _   _
28  il  il  DET RD  Definite=Def|Gender=Masc|Number=Sing|PronType=Art   29  det _
29  secondo secondo ADJ NO  Gender=Masc|Number=Sing|NumType=Ord 30  nsubj   _   _
30  significa   significare VERB    V   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   22  conj    _   _
31  "   "   PUNCT   FB  _   32  punct   _   _
32  così venuto così venuto ADV B   _   30  advmod  _   _
33  "   "   PUNCT   FB  _   32  punct   _   _
34  .   .   PUNCT   FS  _   8   punct   _   _

I think line 32 is invalid because it contains space within one token.

What is curious is in another sentence containing 'così venuto', these two words are regarded as separated tokens:

1   Così    così    ADV B   _   2   advmod  _   _
2   venuto  venire  VERB    V   Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part    0   root    _   _
3   /   /   PUNCT   FF  _   2   punct   _   _
4   Così    così    ADV B   _   5   advmod  _   _
5   andato  andare  VERB    V   Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part    2   conj    _   _
6   .   .   PUNCT   FS  _   2   punct   _   _

Is this a bug? I'd appreciate it if you could investigate this issue.

minhhdvn commented 10 months ago

Hi @gifdog97, Thanks for reporting the issue. Our tokenizer is a neural-based model that was trained on text corpus so it is possible that the tokenization of the same piece of text may vary depending on context.