liuwei1206 / LEBERT

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"
336 stars 60 forks source link

关于NER实验中部分超参数设置的问题 #17

Closed hezongfeng closed 3 years ago

hezongfeng commented 3 years ago

--max_scan_num=1000000 --per_gpu_train_batch_size=4 --per_gpu_eval_batch_size=16 1)论文中NER部分的实验结果也是在以上参数下得出的吗? 2)对于max_scan_num,是否可以理解为只扫描预训练词向量中前1,000,000个词?或者说整个实验中只用到了 https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz这个文件中的前1,000,000个词向量

liuwei1206 commented 3 years ago

Hi,

A1: The training batch_size for each dataset is exactly the same value described in the paper.

A2: A very good question! Yes, --max_scan_num=1000000 means we only use the first 100w words in the embedding. This is a hyperparameter and we set a different value for each dataset. Usually, for the small dataset, we search from {150w, 200w, 300w}; for the big one, we search from {200w, 300w, 500w}. Intuitively, there exists a trade-off for max_scan_num. A large value means the corpus can match more gold segmentation words. However, as the max_scan_num becomes larger, the number of matched incorrect words or called noise words also increases. As a result, the Lexicon Adapter will feel difficult to pick out the correct words.

Hopes it help.

Wei

liuwei1206 commented 3 years ago

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.

hezongfeng commented 3 years ago

OK, I will have a try. Thank you very much. This paper you wrote is really great!

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.