jiesutd / LatticeLSTM

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.
1.8k stars 453 forks source link

weibo数据集达到不论文报告的精度 #29

Closed ljch2018 closed 6 years ago

ljch2018 commented 6 years ago
  1. 我尝试复现论文中的weibo数据集overall的结果,但是test集的F1值仅达到了54,论文是58,没有达到论文的精度;
  2. 我从这里 https://github.com/hltcoe/golden-horse 下载了weibo数据集,使用data/weiboNER_2nd_conll.*文件作为数据集,我使用BIO方式; 我想实现overall的效果,没有对数据进行修改,直接用了全部的数据。
  3. 我的命令如下:
    python main.py --status train \
                --train ./Weibo/weiboNER_2nd_conll.train.bio \
                --dev ./Weibo/weiboNER_2nd_conll.dev.bio \
                --test ./Weibo/weiboNER_2nd_conll.test.bio \
                --savemodel ./Weibo/model \
  4. 这是相关的log输出:
    Train file: ./Weibo/weiboNER_2nd_conll.train.bio
    Dev file: ./Weibo/weiboNER_2nd_conll.dev.bio
    Test file: ./Weibo/weiboNER_2nd_conll.test.bio
    Raw file: None
    Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec
    Bichar emb: None
    Gaz file: data/ctb.50d.vec
    Model saved to: ./Weibo/model
    Load gaz file:  data/ctb.50d.vec  total size: 704368
    gaz alphabet size: 10798
    gaz alphabet size: 12235
    gaz alphabet size: 13671
    build word pretrain emb...
    Embedding:
     pretrain word:11327, prefect match:3281, case_match:0, oov:75, oov%:0.0223413762288
    build biword pretrain emb...
    Embedding:
     pretrain word:0, prefect match:0, case_match:0, oov:42646, oov%:0.999976551692
    build gaz pretrain emb...
    Embedding:
     pretrain word:704368, prefect match:13669, case_match:0, oov:1, oov%:7.31475385853e-05
    Training model...
    DATA SUMMARY START:
     Tag          scheme: BIO
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Use          bigram: False
     Word  alphabet size: 3357
     Biword alphabet size: 42647
     Char  alphabet size: 3357
     Gaz   alphabet size: 13671
     Label alphabet size: 18
     Word embedding size: 50
     Biword embedding size: 50
     Char embedding size: 30
     Gaz embedding size: 50
     Norm     word   emb: True
     Norm     biword emb: True
     Norm     gaz    emb: False
     Norm   gaz  dropout: 0.5
     Train instance number: 1350
     Dev   instance number: 270
     Test  instance number: 270
     Raw   instance number: 0
     Hyperpara  iteration: 100
     Hyperpara  batch size: 1
     Hyperpara          lr: 0.015
     Hyperpara    lr_decay: 0.05
     Hyperpara     HP_clip: 5.0
     Hyperpara    momentum: 0
     Hyperpara  hidden_dim: 200
     Hyperpara     dropout: 0.5
     Hyperpara  lstm_layer: 1
     Hyperpara      bilstm: True
     Hyperpara         GPU: True
     Hyperpara     use_gaz: True
     Hyperpara fix gaz emb: False
     Hyperpara    use_char: False
    DATA SUMMARY END.
    Data setting saved to file:  ./Weibo/model.dset

    @jiesutd 请问你一下,问题可能出在那里啊?多谢!

jiesutd commented 6 years ago

你可以先转成bioes格式,我论文里的结果都是bioes格式输入的。

ljch2018 commented 6 years ago

@jiesutd 感谢指教,我先尝试一下。

HaimianYu commented 6 years ago

这个数据集里面,每一个字符后面的是位置编码吗?(比如:赵0 B-PER.NAM),这个位置编码应该需要去掉吧,不然这个和我们预训练的字符向量不能匹配??

jiesutd commented 6 years ago

@HaimianYu 是的,需要把位置信息抹除掉。