Closed marcusau closed 3 years ago
the vec bin is w2v format, It's first line contains the num of words and the dimension of vectors. In this example, the token_vec_300.bin is word vector file for chinese, and it's file size is about 67M. In this library, I use torchtext to load the word embedding file and It supports both two formats. This project is just for practice, so if you want to use lightnlp for your project, maybe you need to fix some bug when using and modify the source code.
thanks a lot.
it works now..
one more question .how to increase the batch size per epoch in the statement
ner_model.train(train_path, vectors_path=vec_path, dev_path=dev_path, save_path=model_path, log_dir=log_dir,)
Because i would like to run the training on google colab.
The default batch size is 128, i think colab can make it with 218 or above
Thanks a lot
the source code about batch_size in lightNLP/module.py at master · smilelight/lightNLP is:
train_iter = ner_tool.get_iterator(train_dataset, batch_size=DEFAULT_CONFIG['batch_size'])
config = Config(word_vocab, tag_vocab, save_path=save_path, vector_path=vectors_path, **kwargs)
bilstmcrf = BiLstmCrf(config)
there's a bug there in the line of train_iter = ...
, you need to set the batch_size in two place:
ner_model.train(train_path, vectors_path=vec_path, dev_path=dev_path, save_path=model_path, log_dir=log_dir, batch_size=218)
train_iter = ...
train_iter = ner_tool.get_iterator(train_dataset, batch_size=218)
I'll fix the bug soon.
Thanks for your amazing Library
One simple question about your NER module
vec_path = '/home/lightsmile/NLP/embedding/char/token_vec_300.bin'
The vec bin is a w2v or fasttext file?
I already had my w2v pre-trained bin file because its size is only about 200M.
If fasttext bin file, it is always about 1G , not too capable for production environment.
Thanks a lot.