jiesutd / NCRFpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Apache License 2.0
1.88k stars 447 forks source link

list index out of range #146

Closed Rajashan closed 4 years ago

Rajashan commented 4 years ago

I am trying to run the main script, but get the following error.

Traceback (most recent call last):   
File "main.py", line 554, in <module>
train(data)
File "main.py", line 394, in train
print("Shuffle: first input word list:", data.train_Ids[0][0])
IndexError: list index out of range 

The data summary says that there are no data instances.

Seed num: 42
MODEL: train
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BIO
     Split         token:  |||
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Word  alphabet size: 54397
     Char  alphabet size: 93
     Label alphabet size: 4
     Word embedding  dir: None
     Char embedding  dir: None
     Word embedding size: 50
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: sample_data/train_wout.bmes
     Dev    file directory: sample_data/valid_wout.bmes
     Test   file directory: sample_data/test_wout.bmes
     Raw    file directory: None
     Dset   file directory:
     Model  file directory:
     Loadmodel   directory: None
     Decode file directory: None
     Train instance number: 0
     Dev   instance number: 0
     Test  instance number: 0
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++
 Model Network:
     Model        use_crf: True
     Model word extractor: LSTM
     Model       use_char: True
     Model char extractor: CNN
     Model char_hidden_dim: 50
 ++++++++++++++++++++++++++++++++++++++++
 Training:
     Optimizer: SGD
     Iteration: 1
     BatchSize: 10
     Average  batch   loss: False
 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.015
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.5
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: False
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
build sequence labeling network...
use_char:  True
char feature extractor:  CNN
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
build char sequence feature extractor: CNN ...
build CRF...
Epoch: 0/1
 Learning rate is set as: 0.015

I have tried to make my data as similar as possible to the data in sample_data. Here is an example, using a BIO scheme.

udviser B-ORG
empati I-ORG
over I-ORG
for I-ORG
patienter I-ORG
og I-ORG
pårørende I-ORG
. I-ORG
Vagtforpligtigelse O
overlægen O
indgår O

So I am not sue why my data is not getting recognized at all. Any ideas?

jiesutd commented 4 years ago

Train instance number: 0

This means the model can’t load the input file. There must exist data format error for your input data.

Rajashan commented 4 years ago

Any idea on what that could be with the above format? I tried with plain text, calling them bmes, using crlf and lf line endings.

Rajashan commented 4 years ago

They are utf-8 encoded.

jiesutd commented 4 years ago

Check the demo data in this code.

Notice there is a newline between sentences to split the sentences