mesolitica / malaya

Natural Language Toolkit for Malaysian language, https://malaya.readthedocs.io/
MIT License
465 stars 129 forks source link

Non-deterministic dependency parsing and multiple roots / no root #87

Closed waterhall closed 3 years ago

waterhall commented 3 years ago

Hi, I'm trying the dependency parser.

The output is non-deterministic, and sometimes there are multiple roots or no root. Am I using the library correctly? Is there a bug?

Thank you.

model = malaya.dependency.transformer(model = 'xlnet') string = 'Saya suka makan . ' d_object, tagging, indexing = model.predict(string) print(tagging) d_object, tagging, indexing = model.predict(string) print(tagging) d_object, tagging, indexing = model.predict(string) print(tagging)

[('Saya', 'nsubj'), ('suka', 'root'), ('makan', 'root'), ('.', 'punct')] [('Saya', 'nsubj'), ('suka', 'root'), ('makan', 'advcl'), ('.', 'punct')] [('Saya', 'nsubj'), ('suka', 'root'), ('makan', 'xcomp'), ('.', 'punct')]

waterhall commented 3 years ago

Also sometimes it returns -1 as a head and it returns 'PAD' or 'X'.

Could you explain?

The tag set does not match with the dependencies at malaya.dependency.describe()

huseinzol05 commented 3 years ago

Did you tried to set random seed?

import tensorflow as tf
import malaya
tf.compat.v1.set_random_seed(1234)
huseinzol05 commented 3 years ago

And you probably need to stack multiple models, https://malaya.readthedocs.io/en/latest/load-dependency.html#Voting-stack-model

huseinzol05 commented 3 years ago

Some operations in Tensorflow used estimation to speed up the backend, example reduce_sum, https://github.com/tensorflow/tensorflow/issues/3103.

waterhall commented 3 years ago

Ok thanks for that. Could you explain the -1 as a head and the dep labels 'PAD' / 'X'? What do the labels mean?

huseinzol05 commented 3 years ago

during training, there N dynamic length of strings,

[['s','a','y','a'], ['m','a','k','a','n']]
# add padding to make all same length
[['s','a','y','a','PAD'], ['m','a','k','a','n']]

For X, this is because BPE,

label = 'nsubj'
word = 'ayam'
bpe_word = ['ay_', 'am'] # not an actual bpe
bpe_label = ['nsubj', 'X']

I never tested to train the models on,

label = 'nsubj'
word = 'ayam'
bpe_word = ['ay_', 'am'] # not an actual bpe
bpe_label = ['nsubj', 'nsubj']
waterhall commented 3 years ago

okay so is it a error if i get 'X' or 'PAD' as a tag? because i'm giving the parser whole words, not subword-units

and also, i get multiple roots: ('menggantikan', 'root') ('menggabungkan', 'root')

that is a bug also?

[('Keturunan', 'nsubj'), ('Rollo', 'flat'), ("'", 'punct'), ('s', 'conj'), ('Vikings', 'flat'), ('dan', 'cc'), ('isteri', 'conj'), ('Frankish', 'flat'), ('mereka', 'det'), ('akan', 'advmod'), ('menggantikan', 'root'), ('agama', 'obj'), ('Norse', 'flat'), ('dan', 'cc'), ('bahasa', 'conj'), ('Norse', 'flat'), ('Lama', 'flat'), ('dengan', 'case'), ('Katolik', 'nmod'), ('(', 'punct'), ('Kristian', 'appos'), (')', 'punct'), ('dan', 'cc'), ('bahasa', 'conj'), ('Gallo', 'flat'), ('-', 'punct'), ('Romance', 'flat'), ('dari', 'case'), ('penduduk', 'nmod'), ('tempatan', 'amod'), (',', 'punct'), ('menggabungkan', 'root'), ('warisan', 'obj'), ('Frankish', 'flat'), ('ibu', 'compound'), ('mereka', 'det'), ('dengan', 'case'), ('tradisi', 'obl'), ('dan', 'cc'), ('adat', 'conj'), ('resam', 'compound'), ('Old', 'flat'), ('Norse', 'flat'), ('untuk', 'case'), ('mensintesis', 'xcomp'), ('budaya', 'obj'), ('"', 'punct'), ('Norman', 'appos'), ('"', 'punct'), ('yang', 'nsubj'), ('unik', 'amod'), ('di', 'case'), ('utara', 'nmod'), ('Perancis', 'flat'), ('.', 'punct')]
[('Keturunan', 11), ('Rollo', 1), ("'", 7), ('s', 2), ('Vikings', 4), ('dan', 7), ('isteri', 1), ('Frankish', 7), ('mereka', 7), ('akan', 11), ('menggantikan', 0), ('agama', 11), ('Norse', 12), ('dan', 15), ('bahasa', 12), ('Norse', 15), ('Lama', 16), ('dengan', 19), ('Katolik', 12), ('(', 21), ('Kristian', 19), (')', 21), ('dan', 24), ('bahasa', 19), ('Gallo', 25), ('-', 25), ('Romance', 25), ('dari', 30), ('penduduk', 24), ('tempatan', 29), (',', 24), ('menggabungkan', 0), ('warisan', 32), ('Frankish', 34), ('ibu', 34), ('mereka', 35), ('dengan', 39), ('tradisi', 33), ('dan', 40), ('adat', 39), ('resam', 40), ('Old', 40), ('Norse', 42), ('untuk', 46), ('mensintesis', 33), ('budaya', 45), ('"', 49), ('Norman', 46), ('"', 49), ('yang', 52), ('unik', 47), ('di', 53), ('utara', 50), ('Perancis', 53), ('.', 32)]

and sometimes head is -1, is error also? ('feudal', -1)

[('Kaum', 5), ('Norman', 1), ('selepas', 5), ('itu', 3), ('menerima', 0), ('pakai', 5), ('doktrin', 6), ('feudal', -1), ('yang', 11), ('semakin', 10), ('meningkat', 7), ('di', 14), ('seluruh', 15), ('Perancis', 11), (',', 17), ('dan', 23), ('bekerja', 5), ('mereka', 17), ('ke', 21), ('dalam', 20), ('sistem', 17), ('hierarki', 21), ('berfungsi', 17), ('di', 28), ('kedua', 27), ('-', 24), ('dua', 24), ('Normandy', 23), ('dan', 28), ('di', 31), ('England', 28), ('.', 5)]
[('Kaum', 'nsubj'), ('Norman', 'flat'), ('selepas', 'mark'), ('itu', 'det'), ('menerima', 'root'), ('pakai', 'fixed'), ('doktrin', 'obj'), ('feudal', 'amod'), ('yang', 'nsubj'), ('semakin', 'advmod'), ('meningkat', 'acl'), ('di', 'case'), ('seluruh', 'det'), ('Perancis', 'nmod'), (',', 'punct'), ('dan', 'cc'), ('bekerja', 'conj'), ('mereka', 'obj'), ('ke', 'case'), ('dalam', 'case'), ('sistem', 'obl'), ('hierarki', 'compound'), ('berfungsi', 'conj'), ('di', 'case'), ('kedua', 'nummod'), ('-', 'punct'), ('dua', 'nummod'), ('Normandy', 'obl'), ('dan', 'cc'), ('di', 'case'), ('England', 'nmod'), ('.', 'punct')]
huseinzol05 commented 3 years ago

For multiple roots, the model not able to force the output become single root only. For -1, should fix in https://github.com/huseinzol05/malaya/tree/4.3.1