Problem in reproduction (CoNLL)

danny911kr commented 5 years ago

Hi. First of all, thank you for your great project!

I just run your code with CoNLL dataset. In the closed issues, you said that we should just change the iteration (1->100) in your demo.train.config file. However, in your COLING 2018 paper, it states that it used 100-d glove word embedding for CCNN+WLSTM+CRF. Is it right that we also have to change your demo config file as below?

word_emb_dir=glove word_emb_dim=100

and also, I get 90.09 F-1 score on conll 2003 NER with 100d glove dataset. Could you tell me what was the "minimum" score for CCNN+WLSTM+CRF..?

Thanks!

jiesutd commented 5 years ago

Of course, you need to change the embedding dir as well as the pretrained word embeddings dir in the configuration file. In my experience, the lowest score would around 90.9-91.0, the highest score would be 91.3-91.4, based on different random seeds.

As you only got the 90.09, you may check if you used the right tag scheme (BIOES > BIO). Or you can share me with your running log, then I can give you more specific suggestion.

danny911kr commented 5 years ago

Here is the decoding log..! I'm trying with different random seed currently. However, my max was 90.26..! Is there anything I have to change..?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Seed num: 42 MODEL: decode data/conllbio.test ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DATA SUMMARY START: I/O: Start Sequence Laebling task... Tag scheme: BIO Split token: ||| MAX SENTENCE LENGTH: 250 MAX WORD LENGTH: -1 Number normalized: True Word alphabet size: 25306 Char alphabet size: 78 Label alphabet size: 10 Word embedding dir: ./glove.6B.100d.txt Char embedding dir: None Word embedding size: 100 Char embedding size: 30 Norm word emb: False Norm char emb: False Train file directory: data/conllbio.train Dev file directory: data/conllbio.devel Test file directory: data/conllbio.test Raw file directory: data/conllbio.test Dset file directory: conll/conll-2best.dset Model file directory: conll/conll-model2 Loadmodel directory: conll/conll-2best.model Decode file directory: conll/conll-2best.test.out Train instance number: 14986 Dev instance number: 3465 Test instance number: 3683 Raw instance number: 0 FEATURE num: 0 ++++++++++++++++++++++++++++++++++++++++ Model Network: Model use_crf: True Model word extractor: LSTM Model use_char: True Model char extractor: CNN Model char_hidden_dim: 50 ++++++++++++++++++++++++++++++++++++++++ Training: Optimizer: SGD Iteration: 100 BatchSize: 10 Average batch loss: False ++++++++++++++++++++++++++++++++++++++++ Hyperparameters: Hyper lr: 0.015 Hyper lr_decay: 0.05 Hyper HP_clip: None Hyper momentum: 0.0 Hyper l2: 1e-08 Hyper hidden_dim: 200 Hyper dropout: 0.5 Hyper lstm_layer: 1 Hyper bilstm: True Hyper GPU: True DATA SUMMARY END. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ nbest: 1 Load Model from file: conll/conll-model2 build sequence labeling network... use_char: True char feature extractor: CNN word feature extractor: LSTM use crf: True build word sequence feature extractor: LSTM... build word representation... build char sequence feature extractor: CNN ... build CRF... Decode raw data, nbest: 1 ... Right token = 45739 All token = 46665 acc = 0.9801564341583628 raw: time:15.72s, speed:235.73st/s; acc: 0.9802, p: 0.9007, r: 0.9010, f: 0.9009

jiesutd commented 5 years ago

As I said, please turn the tag scheme from BIO to BIOES which will give better performance.

jiesutd commented 5 years ago

You can use my script below to convert the tag scheme between BIO and BIOES.

https://github.com/jiesutd/NCRFpp/blob/master/utils/tagSchemeConverter.py

danny911kr commented 5 years ago

Thanks!! I could get results!! Thank you for your kind reply!!

jiesutd commented 5 years ago

Congratulations and thank you for sharing the results.

jiesutd / NCRFpp

Problem in reproduction (CoNLL) #93