weibo数据集达到不论文报告的精度

ljch2018 commented 6 years ago

我尝试复现论文中的weibo数据集overall的结果，但是test集的F1值仅达到了54，论文是58，没有达到论文的精度；
我从这里 https://github.com/hltcoe/golden-horse 下载了weibo数据集，使用data/weiboNER_2nd_conll.*文件作为数据集，我使用BIO方式；我想实现overall的效果，没有对数据进行修改，直接用了全部的数据。

我的命令如下：

python main.py --status train \
            --train ./Weibo/weiboNER_2nd_conll.train.bio \
            --dev ./Weibo/weiboNER_2nd_conll.dev.bio \
            --test ./Weibo/weiboNER_2nd_conll.test.bio \
            --savemodel ./Weibo/model \

这是相关的log输出：

Train file: ./Weibo/weiboNER_2nd_conll.train.bio
Dev file: ./Weibo/weiboNER_2nd_conll.dev.bio
Test file: ./Weibo/weiboNER_2nd_conll.test.bio
Raw file: None
Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec
Bichar emb: None
Gaz file: data/ctb.50d.vec
Model saved to: ./Weibo/model
Load gaz file:  data/ctb.50d.vec  total size: 704368
gaz alphabet size: 10798
gaz alphabet size: 12235
gaz alphabet size: 13671
build word pretrain emb...
Embedding:
 pretrain word:11327, prefect match:3281, case_match:0, oov:75, oov%:0.0223413762288
build biword pretrain emb...
Embedding:
 pretrain word:0, prefect match:0, case_match:0, oov:42646, oov%:0.999976551692
build gaz pretrain emb...
Embedding:
 pretrain word:704368, prefect match:13669, case_match:0, oov:1, oov%:7.31475385853e-05
Training model...
DATA SUMMARY START:
 Tag          scheme: BIO
 MAX SENTENCE LENGTH: 250
 MAX   WORD   LENGTH: -1
 Number   normalized: True
 Use          bigram: False
 Word  alphabet size: 3357
 Biword alphabet size: 42647
 Char  alphabet size: 3357
 Gaz   alphabet size: 13671
 Label alphabet size: 18
 Word embedding size: 50
 Biword embedding size: 50
 Char embedding size: 30
 Gaz embedding size: 50
 Norm     word   emb: True
 Norm     biword emb: True
 Norm     gaz    emb: False
 Norm   gaz  dropout: 0.5
 Train instance number: 1350
 Dev   instance number: 270
 Test  instance number: 270
 Raw   instance number: 0
 Hyperpara  iteration: 100
 Hyperpara  batch size: 1
 Hyperpara          lr: 0.015
 Hyperpara    lr_decay: 0.05
 Hyperpara     HP_clip: 5.0
 Hyperpara    momentum: 0
 Hyperpara  hidden_dim: 200
 Hyperpara     dropout: 0.5
 Hyperpara  lstm_layer: 1
 Hyperpara      bilstm: True
 Hyperpara         GPU: True
 Hyperpara     use_gaz: True
 Hyperpara fix gaz emb: False
 Hyperpara    use_char: False
DATA SUMMARY END.
Data setting saved to file:  ./Weibo/model.dset

@jiesutd 请问你一下，问题可能出在那里啊？多谢！

jiesutd commented 6 years ago

你可以先转成bioes格式，我论文里的结果都是bioes格式输入的。

ljch2018 commented 6 years ago

@jiesutd 感谢指教，我先尝试一下。

HaimianYu commented 6 years ago

这个数据集里面，每一个字符后面的是位置编码吗？（比如：赵0 B-PER.NAM），这个位置编码应该需要去掉吧，不然这个和我们预训练的字符向量不能匹配？？

jiesutd commented 6 years ago

@HaimianYu 是的，需要把位置信息抹除掉。

jiesutd / LatticeLSTM

weibo数据集达到不论文报告的精度 #29