jiesutd / NCRFpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Apache License 2.0
1.89k stars 446 forks source link

I faced with this error, Maybe I should provide the word embedding when I want to train? #157

Closed myeghaneh closed 3 years ago

myeghaneh commented 4 years ago

I have faced with mismatch, error what I have done, I add my data in the desired format as in train01 and dev01 and test01. do you know what is the problem? (detail in below)

maybe i should add embedding also? if yes, can you give me a hint how can I do that

log file


python main.py --config demo.decode.config

Remainder of file ignored
Seed num: 42
MODEL: decode
sample_data/raw01.bmes
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BIO
     Split         token:  |||
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Word  alphabet size: 4877
     Char  alphabet size: 77
     Label alphabet size: 5
     Word embedding  dir: sample_data/sample.word.emb
     Char embedding  dir: None
     Word embedding size: 50
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: sample_data/train01.bmes
     Dev    file directory: sample_data/dev01.bmes
     Test   file directory: sample_data/test01.bmes
     Raw    file directory: sample_data/raw01.bmes
     Dset   file directory: sample_data/lstmcrf.dset
     Model  file directory: sample_data/lstmcrf
     Loadmodel   directory: sample_data/lstmcrf.0.model
     Decode file directory: sample_data/raw.out
     Train instance number: 0
     Dev   instance number: 0
     Test  instance number: 0
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++
 Model Network:
     Model        use_crf: True
     Model word extractor: LSTM
     Model       use_char: True
     Model char extractor: CNN
     Model char_hidden_dim: 50
 ++++++++++++++++++++++++++++++++++++++++
 Training:
     Optimizer: SGD
     Iteration: 1
     BatchSize: 10
     Average  batch   loss: False
 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.015
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.5
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: False
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
nbest: 10
Load Model from file:  sample_data/lstmcrf
build sequence labeling network...
use_char:  True
char feature extractor:  CNN
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
build char sequence feature extractor: CNN ...
build CRF...
Traceback (most recent call last):
  File "main.py", line 564, in <module>
    decode_results, pred_scores = load_model_decode(data, 'raw')
  File "main.py", line 490, in load_model_decode
    model.load_state_dict(torch.load(data.load_model_dir))
  File "C:\Users\moha\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SeqLabel:
        size mismatch for word_hidden.wordrep.char_feature.char_embeddings.weight: copying a param with shape torch.Size([71, 30]) from checkpoint, the shape in current model is torch.Size([77, 30]).
        size mismatch for word_hidden.wordrep.word_embedding.weight: copying a param with shape torch.Size([3115, 50]) from checkpoint, the shape in current model is torch.Size([4877, 50]).
        size mismatch for word_hidden.hidden2tag.weight: copying a param with shape torch.Size([20, 200]) from checkpoint, the shape in current model is torch.Size([7, 200]).
        size mismatch for word_hidden.hidden2tag.bias: copying a param with shape torch.Size([20]) from checkpoint, the shape in current model is torch.Size([7]).
        size mismatch for crf.transitions: copying a param with shape torch.Size([20, 20]) from checkpoint, the shape in current model is torch.Size([7, 7]).

(base) C:\ArgMin\NCRFpp-master01\NCRFpp-master>python main.py --config demo.train.config

Seed num: 42
MODEL: train
Load pretrained word embedding, norm: False, dir: sample_data/sample.word.emb
Embedding:
     pretrain word:15093, prefect match:2077, case_match:208, oov:2591, oov%:0.5312692228829198
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BIO
     Split         token:  |||
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Word  alphabet size: 4877
     Char  alphabet size: 77
     Label alphabet size: 5
     Word embedding  dir: sample_data/sample.word.emb
     Char embedding  dir: None
     Word embedding size: 50
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: sample_data/train01.bmes
     Dev    file directory: sample_data/dev01.bmes
     Test   file directory: sample_data/test01.bmes
     Raw    file directory: None
     Dset   file directory: None
     Model  file directory: sample_data/lstmcrf
     Loadmodel   directory: None
     Decode file directory: None
     Train instance number: 0
     Dev   instance number: 0
     Test  instance number: 0
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++
 Model Network:
     Model        use_crf: True
     Model word extractor: LSTM
     Model       use_char: True
     Model char extractor: CNN
     Model char_hidden_dim: 50
 ++++++++++++++++++++++++++++++++++++++++
 Training:
     Optimizer: SGD
     Iteration: 1
     BatchSize: 10
     Average  batch   loss: False
 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.015
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.5
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: False
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
build sequence labeling network...
use_char:  True
char feature extractor:  CNN
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
build char sequence feature extractor: CNN ...
build CRF...
Epoch: 0/1
 Learning rate is set as: 0.015
Traceback (most recent call last):
  File "main.py", line 554, in <module>
    train(data)
  File "main.py", line 394, in train
    print("Shuffle: first input word list:", data.train_Ids[0][0])
IndexError: list index out of range

config file

### use # to comment out the configure item

### I/O ###
train_dir=sample_data/train01.bmes
dev_dir=sample_data/dev01.bmes
test_dir=sample_data/test01.bmes
model_dir=sample_data/lstmcrf
word_emb_dir=sample_data/sample.word.emb

#raw_dir=
#decode_dir=
#dset_dir=
#load_model_dir=
#char_emb_dir=

norm_word_emb=False
norm_char_emb=False
number_normalized=True
seg=True
word_emb_dim=50
char_emb_dim=30

###NetworkConfiguration###
use_crf=True
use_char=True
word_seq_feature=LSTM
char_seq_feature=CNN
#feature=[POS] emb_size=20
#feature=[Cap] emb_size=20
#nbest=1

###TrainingSetting###
status=train
optimizer=SGD
iteration=1
batch_size=10
ave_batch_loss=False

###Hyperparameters###
cnn_layer=4
char_hidden_dim=50
hidden_dim=200
dropout=0.5
lstm_layer=1
bilstm=True
learning_rate=0.015
lr_decay=0.05
momentum=0
l2=1e-8
#gpu
#clip=

sample data

Yes B-P
, I-P
it I-P
's I-P
annoying I-P
and I-P
cumbersome I-P
to I-P
separate I-P
your I-P
rubbish I-P
properly I-P
all I-P
the I-P
time I-P
. I-P
Three B-P
different I-P
bin I-P
bags I-P
stink I-P
away I-P
in I-P
the I-P
kitchen I-P
and I-P
have I-P
to I-P
be I-P
sorted I-P
into I-P
different I-P
wheelie I-P
bins I-P
. I-P
But B-P
still I-P
Germany I-P
produces I-P
way I-P
too I-P
much I-P
rubbish I-P
and B-P
too I-P
many I-P
resources I-P
are I-P
lost I-P
when I-P
what I-P
actually I-P
should I-P
be I-P
separated I-P
and I-P
recycled I-P
is I-P
burnt I-P
. I-P
We B-C
Berliners

I have also add errors='ignore' to

 items = open(self.train_dir,'r',errors='ignore').readline().strip('\n').split('\t')

in funvtion and data

myeghaneh commented 4 years ago

it actually solved! the problem was that between each sentence should be a space,

Yes B-P , I-P it I-P 's I-P annoying I-P and I-P cumbersome I-P to I-P separate I-P your I-P rubbish I-P properly I-P all I-P the I-P time I-P . I-P -------------------------------> space.....it is not bad if you mention it for user Three B-P different I-P bin I-P bags I-P stink I-P away I-P in I-P