jiesutd / NCRFpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Apache License 2.0
1.89k stars 446 forks source link

Reading fasttext word embeddings #177

Closed jd-coderepos closed 3 years ago

jd-coderepos commented 3 years ago

A user reported success using a fasttext model (ref: https://github.com/jiesutd/NCRFpp/issues/80). However, I can't seem to get it to work. I share a part of my training model summary stats below.

Seed num: 42
MODEL: train
Load pretrained word embedding, norm: False, dir: ../drive/MyDrive/fasttext/wiki.en.vec
Embedding:
     pretrain word:1, prefect match:0, case_match:0, oov:8105, oov%:0.9998766345916605
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BMES
     Split         token:  ||| 
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Word  alphabet size: 8106
     Char  alphabet size: 76
     Label alphabet size: 6
     Word embedding  dir: ../drive/MyDrive/fasttext/wiki.en.vec
     Char embedding  dir: None
     Word embedding size: 1
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: ../drive/MyDrive/labeling/train.data
     Dev    file directory: ../drive/MyDrive/labeling/dev.data
     Test   file directory: ../drive/MyDrive/labeling/test.data
     Raw    file directory: None
     Dset   file directory: None
     Model  file directory: ../drive/MyDrive/labeling/ccnn-wbilstm-crf-fasttext
     Loadmodel   directory: None
     Decode file directory: None
     Train instance number: 348
     Dev   instance number: 19
     Test  instance number: 59
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++    

I am downloading the English text file model from here https://fasttext.cc/docs/en/pretrained-vectors.html

Happy to know any comments. Thank you!

jiesutd commented 3 years ago

“Embedding: pretrain word:1, prefect match:0, case_match:0, oov:8105, oov%:0.9998766345916605”

The message showed that the embeddings were not loaded correctly. Double-check the format of the embeddings. If the first line of the embeddings is not embedding (for word2vec, the first line is alphabet size and embedding size), you should delete this line.

jd-coderepos commented 3 years ago

Thank you for the tip, works very well now :)

Seed num: 42
MODEL: train
Load pretrained word embedding, norm: False, dir: ../drive/MyDrive/fasttext/new-wiki.en.vec
Embedding:
     pretrain word:2518768, prefect match:4704, case_match:952, oov:2449, oov%:0.30212188502343945
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BMES
     Split         token:  ||| 
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Word  alphabet size: 8106
     Char  alphabet size: 76
     Label alphabet size: 6
     Word embedding  dir: ../drive/MyDrive/fasttext/new-wiki.en.vec
     Char embedding  dir: None
     Word embedding size: 300
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: ../drive/MyDrive/train.data
     Dev    file directory: ../drive/MyDrive/dev.data
     Test   file directory: ../drive/MyDrive/test.data
     Raw    file directory: None
     Dset   file directory: None
     Model  file directory: ../drive/MyDrive/ccnn-wbilstm-crf-fasttext-lr001
     Loadmodel   directory: None
     Decode file directory: None
     Train instance number: 348
     Dev   instance number: 19
     Test  instance number: 59
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++
 Model Network:
     Model        use_crf: True
     Model word extractor: LSTM
     Model       use_char: True
     Model char extractor: CNN
     Model char_hidden_dim: 50
 ++++++++++++++++++++++++++++++++++++++++
 Training:
     Optimizer: SGD
     Iteration: 200
     BatchSize: 10
     Average  batch   loss: False
 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.01
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.5
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: False
DATA SUMMARY END.