In the experiment of msra dataset, where did I do wrong?

Robets2020 commented 6 years ago

In my experiment, char embeddings and word embeddings are gigaword_chn.all.a2b.uni.ite50.vec and ctb.50d.vec respectively, while bichar_emb is set to None. Other parameters take the default value in the code. Currently, 80 epoches on a nividia 1080 Ti GPU has been run, the test result on the msra test dataset did not reach the result in the paper, and the best result is acc: 0.9891, p: 0.9331, r: 0.9093, f: 0.9210. Where did I do wrong?

In addition, if char embeddings trained on Chinese Wikipedia (bigger than gigaword, and the embeddings contain 16115 words, 100 dimensions) are used instead of gigaword_chn.all.a2b.uni.ite50.vec(11327 words, 50 dimensions), the difference of test results between Bi-LSTM+CRF based on char + softword and LatticeLSTM (also using the same char embeddings trained on Chinese Wikipedia) is small. Is the big difference in the paper because of the use of a weaker char embedding?

jiesutd commented 6 years ago

Can you show me your experiment log file? I can’t tell what’s the problem of your experiment without the details. But the results can be reproduced because someone told me he can get 93.37 with the lattice lstm structure.

Robets2020 commented 6 years ago

CuDNN: True GPU available: True Status: train Seg: True Train file: ../data_all/msra.train Dev file: ../data_all/msra.test Test file: ../data_all/msra.test Raw file: None Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec Bichar emb: None Gaz file: data/ctb.50d.vec Model saved to: model_save_msra/model_saved Load gaz file: data/ctb.50d.vec total size: 704368 gaz alphabet size: 108009 gaz alphabet size: 112072 gaz alphabet size: 112072 build word pretrain emb... Embedding: pretrain word:11327, prefect match:4750, case_match:0, oov:70, oov%:0.0145198091682 build biword pretrain emb... Embedding: pretrain word:0, prefect match:0, case_match:0, oov:315531, oov%:0.999996830749 build gaz pretrain emb... Embedding: pretrain word:704368, prefect match:112070, case_match:0, oov:1, oov%:8.92283532015e-06 Training model... DATA SUMMARY START: Tag scheme: BIO MAX SENTENCE LENGTH: 250 MAX WORD LENGTH: -1 Number normalized: True Use bigram: False Word alphabet size: 4821 Biword alphabet size: 315532 Char alphabet size: 4821 Gaz alphabet size: 112072 Label alphabet size: 8 Word embedding size: 50 Biword embedding size: 50 Char embedding size: 30 Gaz embedding size: 50 Norm word emb: True Norm biword emb: True Norm gaz emb: True Norm gaz dropout: 0.5 Train instance number: 46306 Dev instance number: 4361 Test instance number: 4361 Raw instance number: 0 Hyperpara iteration: 100 Hyperpara batch size: 1 Hyperpara lr: 0.015 Hyperpara lr_decay: 0.05 Hyperpara HP_clip: 5.0 Hyperpara momentum: 0 Hyperpara hidden_dim: 200 Hyperpara dropout: 0.5 Hyperpara lstm_layer: 1 Hyperpara bilstm: True Hyperpara GPU: True Hyperpara use_gaz: True Hyperpara fix gaz emb: False Hyperpara use_char: False DATA SUMMARY END. Data setting saved to file: model_save_msra/model_saved.dset build batched lstmcrf... build batched bilstm... build LatticeLSTM... forward , Fix emb: False gaz drop: 0.5 load pretrain word emb... (112072, 50) build LatticeLSTM... backward , Fix emb: False gaz drop: 0.5 load pretrain word emb... (112072, 50) build batched crf... finished built model. Epoch: 0/100 Learning rate is setted as: 0.015 Instance: 500; Time: 135.77s; loss: 7252.2418; acc: 20057.0/23097.0=0.8684 Instance: 1000; Time: 137.81s; loss: 4270.2162; acc: 41414.0/46921.0=0.8826 Instance: 1500; Time: 138.46s; loss: 3263.4362; acc: 62703.0/70403.0=0.8906 Instance: 2000; Time: 135.55s; loss: 2883.2704; acc: 83530.0/93137.0=0.8969 Instance: 2500; Time: 131.57s; loss: 2852.4989; acc: 104249.0/115773.0=0.9005 Instance: 3000; Time: 127.30s; loss: 2122.8196; acc: 124314.0/137332.0=0.9052 Instance: 3500; Time: 130.17s; loss: 2293.1783; acc: 145697.0/160342.0=0.9087 Instance: 4000; Time: 130.64s; loss: 2033.5865; acc: 166517.0/182741.0=0.9112 Instance: 4500; Time: 139.14s; loss: 2385.4420; acc: 188535.0/206620.0=0.9125 Instance: 5000; Time: 137.74s; loss: 1942.3583; acc: 210202.0/229693.0=0.9151 Instance: 5500; Time: 138.85s; loss: 1760.2437; acc: 231750.0/252636.0=0.9173 Instance: 6000; Time: 142.32s; loss: 2034.1484; acc: 253825.0/276302.0=0.9187 Instance: 6500; Time: 135.29s; loss: 1830.8003; acc: 275090.0/298984.0=0.9201 Instance: 7000; Time: 143.21s; loss: 2331.3439; acc: 296953.0/322608.0=0.9205 Instance: 7500; Time: 140.71s; loss: 1849.1990; acc: 318623.0/345704.0=0.9217 Instance: 8000; Time: 144.95s; loss: 1766.9981; acc: 341329.0/369781.0=0.9231 Instance: 8500; Time: 142.80s; loss: 1715.5968; acc: 363510.0/393309.0=0.9242 Instance: 9000; Time: 143.70s; loss: 1498.4651; acc: 385363.0/416395.0=0.9255 Instance: 9500; Time: 140.18s; loss: 1495.8922; acc: 407107.0/439293.0=0.9267

jiesutd commented 6 years ago

Hi @Robert201806 , I compared your log with mine, the difference is that the gaz embeddings should not be normalized Norm gaz emb: False. This will affect the system performance a lot.

jiesutd commented 6 years ago

Here is my log information for your reference:

CuDNN: True GPU available: True Status: train Seg: True Train file: ../data/msra_cn_ner_char/train_dev.bmes Dev file: ../data/msra_cn_ner_char/test.bmes Test file: ../data/msra_cn_ner_char/test.bmes Raw file: None Char emb: ../data/gigaword_chn.all.a2b.uni.ite50.vec Bichar emb: None Gaz file: ../data/ctb.50d.vec Model saved to: ../data/msra_cn_ner_char/gaz.tune.drop0.5 Load gaz file: ../data/ctb.50d.vec total size: 704368 gaz alphabet size: 108008 gaz alphabet size: 112071 gaz alphabet size: 112071 build word pretrain emb... Embedding: pretrain word:11327, prefect match:4751, case_match:0, oov:70, oov%:0.0145167980091 build biword pretrain emb... Embedding: pretrain word:0, prefect match:0, case_match:0, oov:315530, oov%:0.999996830739 build gaz pretrain emb... Embedding: pretrain word:704368, prefect match:112070, case_match:0, oov:0, oov%:0.0 Training model... DATA SUMMARY START: Tag scheme: BMES MAX SENTENCE LENGTH: 250 MAX WORD LENGTH: -1 Number normalized: True Use bigram: False Word alphabet size: 4822 Biword alphabet size: 315531 Char alphabet size: 4823 Gaz alphabet size: 112071 Label alphabet size: 14 Word embedding size: 50 Biword embedding size: 50 Char embedding size: 30 Gaz embedding size: 50 Norm word emb: True Norm biword emb: True Norm gaz emb: False Train instance number: 46306 Dev instance number: 4361 Test instance number: 4361 Raw instance number: 0 Hyperpara iteration: 100 Hyperpara batch size: 1 Hyperpara lr: 0.015 Hyperpara lr_decay: 0.05 Hyperpara HP_clip: 5.0 Hyperpara momentum: 0 Hyperpara hidden_dim: 200 Hyperpara dropout: 0.5 Hyperpara lstm_layer: 1 Hyperpara bilstm: True Hyperpara GPU: True Hyperpara use_gaz: True Hyperpara fix gaz emb: False Hyperpara use_char: False DATA SUMMARY END. build batched lstmcrf... build batched bilstm... build SkipLSTM... forward , Fix emb: False gaz drop: 0.5 load pretrain word emb... (112071, 50) build SkipLSTM... backward , Fix emb: False gaz drop: 0.5 load pretrain word emb... (112071, 50) build batched crf... finished built model. Epoch: 0/100 Learning rate is setted as: 0.015 Instance: 500; Time: 113.48s; loss: 8335.7647; acc: 20163.0/23097.0=0.8730 Instance: 1000; Time: 117.17s; loss: 4413.0082; acc: 41565.0/46921.0=0.8859 Instance: 1500; Time: 116.07s; loss: 3119.5857; acc: 62997.0/70403.0=0.8948 Instance: 2000; Time: 115.86s; loss: 2626.0455; acc: 84023.0/93137.0=0.9021 Instance: 2500; Time: 111.80s; loss: 2544.3521; acc: 104846.0/115773.0=0.9056 Instance: 3000; Time: 106.09s; loss: 1940.4282; acc: 125105.0/137332.0=0.9110 Instance: 3500; Time: 113.18s; loss: 2138.9087; acc: 146553.0/160342.0=0.9140 Instance: 4000; Time: 110.27s; loss: 1799.1360; acc: 167592.0/182741.0=0.9171 Instance: 4500; Time: 117.53s; loss: 2164.7759; acc: 189836.0/206620.0=0.9188

Robets2020 commented 6 years ago

Thank you. BMES tag scheme was used in your experiment?

jiesutd commented 6 years ago

YES, it is actually the BIOES tag scheme. Generally, BIOES will give a better result than BIO (at least in CoNLL 2003 English dataset).

jiesutd commented 6 years ago

I have written a tag converter script here: https://github.com/jiesutd/NCRFpp/blob/master/utils/tagSchemeConverter.py . Hope it works in Chinese NER. You can have a try if necessary.

Robets2020 commented 6 years ago

Can you paste more logs? I am worried that the training process will go wrong again.

jiesutd commented 6 years ago

Here is the log msra.unigram.gaz.tune.drop0.5.log. It gives the best result of 93.18 within 30 epochs and it seems the final result will give further improvement with more iterations.

helloword12345678 commented 6 years ago

hi the file gigaword_chn.all.a2b.uni.ite50.vec is the char embedding file,and it's content like this

中 0.218149 1.403142 3.397185 -6.166873 -3.565661 -2.417709 -4.509896 1.463636 -0.550171 3.181630 5.352833 10.920075 -0.016650 -1.790418 5.159950 -0.205418 4.013869 1.796362 0.355024 2.764315 1.240198 2.466680 10.205541 -2.583954 -2.133391 2.468411 -2.563997 -5.540087 2.686093 -2.683196 -0.713770 -0.745902 1.646485 -1.759226 3.043046 -1.753233 -7.549716 -0.422689 4.617973 2.930810 -0.560774 2.921355 1.878473 0.757087 -1.174531 -1.169185 0.037929 -0.125337 0.889473 3.427402 we can see this dimension is 50,why you debug console log is Char embedding size: 30?????

helloword12345678 commented 6 years ago

i am not familar with pytorch,and in the file charcnn.py i see: self.char_embeddings = nn.Embedding(alphabet_size, embedding_dim) self.charembeddings.weight.data.copy(torch.from_numpy(self.random_embedding(alphabet_size, embedding_dim))) the weight is initailized random,is this mean pretrained char embedding file no use any more? thanks

jiesutd commented 6 years ago

Hello @helloword12345678 , the LatticeLSTM is based on the initial version of NCRF++ where the inputs are separated in character and word level. However, in this code, Chinese characters are recognized as "words" in NCRF++ and the "character" in NCRF++ is not used. I should have modified the print statement not clearly in this code.

So what you see about the character(dimension, pretraining) in this code is not related with our model, as they are not used.

jiesutd / LatticeLSTM

In the experiment of msra dataset, where did I do wrong? #6