nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
861 stars 153 forks source link

How to train on gold tags dataset #21

Open chiehminwei opened 5 years ago

chiehminwei commented 5 years ago

I have a copy of the revised PennTreebank that looks like the format of the files in data/. However, the code breaks when I try to use these files. On further inspection, I'm guessing I need to insert a "TOP" tag at the start of every sentence? I did that and the model starts training, but then the EVAL script doesn't work. My copy of the treebank is somehow also missing a sentence. Is this what's causing the problem for the EVAL script? Can I just copy and paste the sentence that's missing from the silver trees you provided?

chiehminwei commented 5 years ago

This is the error I got from the EVAL script. I'm guessing my data is corrupted. Where do you obtain the data? Do you use the revised PennTreebank, or do you use some script to convert from PennTreebank 3.0? 2 : Length unmatch (43|42) 4 : Words unmatch (self|`) 8 : Length unmatch (24|25) 17 : Words unmatch (-LRB-|D.) 53 : Length unmatch (35|34) 58 : Length unmatch (36|34) 76 : Length unmatch (30|31) 80 : Length unmatch (52|49) 82 : Length unmatch (54|53) 86 : Length unmatch (27|26) 97 : Length unmatch (31|29) 99 : Length unmatch (46|44) 104 : Length unmatch (28|27) 107 : Length unmatch (51|49) 110 : Length unmatch (31|29) 132 : Length unmatch (15|16) 171 : Length unmatch (19|20) 172 : Length unmatch (14|16) 177 : Length unmatch (24|22) 204 : Length unmatch (12|13) 208 : Length unmatch (51|49) 216 : Length unmatch (37|36) 219 : Length unmatch (27|28) 244 : Length unmatch (17|15) 287 : Length unmatch (26|24) 317 : Length unmatch (14|15) 326 : Length unmatch (32|31) 339 : Length unmatch (38|36) 361 : Length unmatch (55|56) 370 : Length unmatch (38|39) 423 : Length unmatch (30|31) 424 : Length unmatch (18|20) 431 : Length unmatch (40|39) 435 : Length unmatch (31|29) 462 : Length unmatch (30|29) 466 : Length unmatch (39|37) 470 : Length unmatch (16|15) 471 : Length unmatch (23|22) 475 : Length unmatch (37|36) 476 : Length unmatch (29|30) 479 : Length unmatch (21|22) 484 : Length unmatch (20|18) 488 : Length unmatch (9|8) 489 : Length unmatch (18|17) 491 : Length unmatch (30|31) 511 : Length unmatch (33|34) 515 : Length unmatch (21|22) 525 : Length unmatch (28|29) 534 : Length unmatch (24|25) 546 : Length unmatch (11|12) 585 : Length unmatch (24|22) 586 : Length unmatch (42|43) 595 : Length unmatch (21|20)

nikitakit commented 5 years ago

I posted treebank conversion scripts at https://github.com/nikitakit/parser-data-gen

These scripts are able to recover the gold-tag data format I have directly from the LDC release.

When it comes to EVALB errors, it's actually normal to see some for first 1-2 epochs of training (and especially the first time a model is evaluated on the dev set). In fact for a randomly-initialized parser EVALB might just crash instead of returning a very low accuracy. Errors should go away after a parser has been trained for a few epochs and starts producing reasonable/non-random outputs.

The "length unmatch" message can occur when predicted punctuation tags differ from the gold tags, because punctuation is excluded from length calculation in the standard evaluation. The "words unmatch" error, on the other hand, looks like a potential data processing issue.

chiehminwei commented 5 years ago

Thank you so much for the scripts! They're really helpful. I've successfully converted PTB3.0 and EVALB is looking good. Did you use the same scripts for converting Chinese treebank? Where should I place the files for Chinese?

nikitakit commented 5 years ago

I added a CTB processing script as well: https://github.com/nikitakit/parser-data-gen/blob/master/corpora/ctb_5.1/build_corpus.sh

You'll have to change the reference to ${HOME}/data/ctb_5.1/ to instead point to the right location on your machine.