nltk / nltk_data

NLTK Data
1.47k stars 1.05k forks source link

russian pos tag mapping is wrong #116

Open ghost opened 6 years ago

ghost commented 6 years ago

Current mapping is

!   .
A   ADJ
AD  ADV 
C   CONJ
COMP    CONJ
IJ  X
NC  NUM
NN  NOUN
P   ADP
PTCL    PRT
V   VERB
VG  VERB
VI  VERB
VP  VERB
YES_NO_SENT X
Z   X

while on http://www.ruscorpora.ru/en/corpora-morph.html it is different.

For example,

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

which is taken from https://www.nltk.org/api/nltk.tag.html