nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
871 stars 153 forks source link

Need help understanding the labels of the parser model #104

Open sujoung opened 1 year ago

sujoung commented 1 year ago

Hello! Firstly I have to say that I love this project. Really helping me exploring syntax of different kinds of text. So thank you so much!

I have a question regarding tagsets. I am using swedish model, and few years back, I remember it used to be based on Swedish treebank tagset called Mamba. But it seems like it has been changed in the new version (benepar-sv2).

I tried to print what kind of labels have been used to train the core model, and I got these results.

>>> parser._parser.label_vocab
{'': 0,
 'AP': 1,
 'AP::AP': 2,
 'AP::XP': 3,
 'AVP': 4,
 'AVP::XP': 5,
 'NP': 6,
 'NP::AP': 7,
 'NP::NP': 8,
 'NP::NP::AP': 9,
 'NP::NP::NP::NP::XP': 10,
 'NP::NP::S': 11,
 'NP::NP::VP': 12,
 'NP::PP': 13,
 'NP::S': 14,
 'NP::XP': 15,
 'NP::XP::NP': 16,
 'NP::XP::S': 17,
 'PP': 18,
 'PP::AVP': 19,
 'PP::AVP::XP': 20,
 'PP::NP': 21,
 'PP::XP': 22,
 'PSEUDO': 23,
 'S': 24,
 'S::AVP': 25,
 'S::NP': 26,
 'S::NP::NP': 27,
 'S::NP::NP::NP::NP': 28,
 'S::NP::S': 29,
 'S::NP::XP': 30,
 'S::NP::XP::S': 31,
 'S::PP': 32,
 'S::PP::NP': 33,
 'S::S': 34,
 'S::S::NP': 35,
 'S::S::NP::NP': 36,
 'S::VP': 37,
 'S::XP': 38,
 'VP': 39,
 'VP::AP': 40,
 'VP::PP': 41,
 'VP::S': 42,
 'VP::VP': 43,
 'VP::XP': 44,
 'XP': 45,
 'XP::AVP': 46,
 'XP::NP': 47,
 'XP::PP': 48,
 'XP::S': 49}
>>> parser._parser.tag_vocab
{'AB': 1,
 'DT': 2,
 'HA': 3,
 'HD': 4,
 'HP': 5,
 'HS': 6,
 'IE': 7,
 'IN': 8,
 'JJ': 9,
 'KN': 10,
 'MAD': 11,
 'MID': 12,
 'NN': 13,
 'P': 14,
 'PAD': 15,
 'PC': 16,
 'PL': 17,
 'PM': 18,
 'PN': 19,
 'PS': 20,
 'RG': 21,
 'RO': 22,
 'SN': 23,
 'UNK': 0,
 'UO': 24,
 'VB': 25}

What is the difference between NP::NP::S and S::NP::NP ?

Screenshot 2023-09-20 at 10 45 57

In this example ( In English: Hello, I am a banana) There is a S (simple declarative clause) which has 2 NPs as children. Would this be NP::NP::S or S::NP::NP ? And what is happening with AUX? Because, for me it is hard to think about any structure where S has only 2 NPs. Because at least one VP is required to become a S.

Also, general question: I saw from #30 that you are using this for training: http://surdeanu.cs.arizona.edu//mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html Is it same for Swedish model and other language's models? For example unlike English model, I see there is no FRAG in labels for Swedish models. Is this because of the nature of the language itself? Or did you use different label set for different languages?