smilelight / lightNLP

基于Pytorch和torchtext的自然语言处理深度学习框架。
Apache License 2.0
823 stars 212 forks source link

token_vec_300.bin is a w2v or fasttext format? #15

Closed marcusau closed 3 years ago

marcusau commented 3 years ago

Thanks for your amazing Library

One simple question about your NER module

vec_path = '/home/lightsmile/NLP/embedding/char/token_vec_300.bin'

The vec bin is a w2v or fasttext file?

I already had my w2v pre-trained bin file because its size is only about 200M.

If fasttext bin file, it is always about 1G , not too capable for production environment.

Thanks a lot.

smilelight commented 3 years ago

the vec bin is w2v format, It's first line contains the num of words and the dimension of vectors. In this example, the token_vec_300.bin is word vector file for chinese, and it's file size is about 67M. In this library, I use torchtext to load the word embedding file and It supports both two formats. This project is just for practice, so if you want to use lightnlp for your project, maybe you need to fix some bug when using and modify the source code.

marcusau commented 3 years ago

thanks a lot.

it works now..

one more question .how to increase the batch size per epoch in the statement

ner_model.train(train_path, vectors_path=vec_path, dev_path=dev_path, save_path=model_path, log_dir=log_dir,)

Because i would like to run the training on google colab.

The default batch size is 128, i think colab can make it with 218 or above

Thanks a lot

smilelight commented 3 years ago

the source code about batch_size in lightNLP/module.py at master · smilelight/lightNLP is:

train_iter = ner_tool.get_iterator(train_dataset, batch_size=DEFAULT_CONFIG['batch_size'])
config = Config(word_vocab, tag_vocab, save_path=save_path, vector_path=vectors_path, **kwargs)
bilstmcrf = BiLstmCrf(config)

there's a bug there in the line of train_iter = ..., you need to set the batch_size in two place:

  1. set batch_size in the train statement:
    ner_model.train(train_path, vectors_path=vec_path, dev_path=dev_path, save_path=model_path, log_dir=log_dir, batch_size=218)
  2. set batch_size in the line of train_iter = ...
    train_iter = ner_tool.get_iterator(train_dataset, batch_size=218)

    I'll fix the bug soon.