jiesutd / LatticeLSTM

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.
1.8k stars 453 forks source link

Have you tried setting data.HP_batch_size to a value greater than 1? #22

Closed Robets2020 closed 6 years ago

Robets2020 commented 6 years ago

Have you tried setting data.HP_batch_size to a value greater than 1? I set data.HP_batch_size to 100, but the training results are not good. On the msra dataset, f1 is stable around 0.74. I switched to the adam optimization method, the effect is not as good as the previous default setting, f1 is about 0.91- 0.92.

jiesutd commented 6 years ago

I didn't try to use bigger batch_size. Actually, the bigger batch size has the same speed of batch_size=1 due to implementation issues. So I suggest you just set batch_size =1, otherwise you need to tune hyperparameters with different batch size.

BTW, the default setting can give more than 93% F1 value on MSRA NER dataset. Some guys have achieved slightly higher performance than our paper. You may check your data and setting if your result is only 91-92.

Robets2020 commented 6 years ago

Actually, on MSRA NER datase, using the default setting, my f1 experimental result is 93.40% (acc: 0.9900, p: 0.9420, r: 0.9260, f: 0.9340). But the training speed is too slow. If there is a batch version of LatticeLSTM, the loss of the batch version(assume that the batch is set to n ) should be equivalent to the loss of original version with data.HP_batch_size=n, is that right? When data.HP_batch_size is set to n (n=100), with the default sgd in the code, the result is 0.74. Equivalently, batch version will also encounter the same problem. I use optimization method (adam) and hyperparameters same as ChineseNER(Char embedding +CWS bound feature, https://github.com/zjy-ucas/ChineseNER), the F1 of LatticeLSTM is about 0.91-0.92. The LatticeLSTM modoel overfit quickly.

jiesutd commented 6 years ago
  1. Glad to hear that you got a good result on MSRA dataset.

  2. No, there exists different loss implementations of larger batch size. In my code, the loss is accumulated within the batch and update one time for each batch. In this setting, if you use a large batch size, the accumulated loss is very large which will make the training unstable. So it is no surprise that the batch_size =100 will lead a low performance. If you really want to set batch_size=100, at least you can try: 1) tune a smaller learning rate 2) average the loss with batch (loss=loss/batch_size). That's why I said you need to tune hyperparameters for larger batch size.

  3. Please notice that my implementation already use the learning rate decay which is not compatible with adam. If you want to use the adam, please remember turn off the lr decay algorithm. https://github.com/jiesutd/LatticeLSTM/blob/7c1bf5be8828a097697ab4b4fade8cdb21a8a388/main.py#L253 In addition, you can not use the hyperparameters from one model to a different one, as they are different structure. Minor hyperparameters finetune is necessary. BTW, our previous research shows that adam is not always better that SGD, sometimes it will give lower performance with simple SGD. You can refer our COLING 2018 paper here

Robets2020 commented 6 years ago

I have tried tuning a smaller learning rate with sgd and the f1 result is also not very high, and I have averaged the loss with batch (loss=loss/batch_size) too. I have commented out this sentence, optimizer = lr_decay(optimizer, idx, data.HP_lr_decay, data.HP_lr), when using adam.
I also try two another two optimization methods. In order to solve the overfitting, I also adjusted some hyperparameters.

jiesutd commented 6 years ago

I think for large batch size, maybe the final performance is slightly behind the batch_size=1, but it should absolutely give much better result than 74%. There must exist some incorrect settings for larger batch size.

Robets2020 commented 6 years ago

You can verify this problem. I hope it have a good result in large batch size. If the batch size can only be set to 1 or a small value to get a good result, the actual industrial value of this method is not that big when training with more data.

jiesutd commented 6 years ago

Yes, I will try to check if this problem exists when I have time (not recently).

At least it can be trained in batch_size=1 and use a large batch size for decoding, which may be useful in industry.