microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.09k stars 2.55k forks source link

The --not_predict_token still allows the token to be predicted #118

Closed aretius closed 4 years ago

aretius commented 4 years ago

Hey all, I am trying to use the --not_predict_token to remove a certain token from predictions which is the - [CLS] token. The issue is that while decoding on news texts it produces outputs like - Stocks in the news: RIL [CLS] [CLS][CLS][CLS][CLS][CLS][CLS][CLS][CLS][CLS][CLS][CLS][CLS] and goes on. Of course, if I manually truncate the output from the CLS token it makes perfect sense. However, I opted to use the --not_predict_token despite which I see outputs like above. Any tips on how to improve such cases?

donglixp commented 4 years ago
  1. We could check whether the token id is correctly converted at https://github.com/microsoft/unilm/blob/b3b78ee8710060dcada404acd015a79fab8343cb/unilm-v1/src/biunilm/decode_seq2seq.py#L172

  2. If the token id was correctly obtained in the previous step, its decoding score would be set to -10000 as in https://github.com/microsoft/unilm/blob/b3b78ee8710060dcada404acd015a79fab8343cb/unilm-v1/src/pytorch_pretrained_bert/modeling.py#L1539 , so that the blocked token won't appear in the top-k candidates.

aretius commented 4 years ago

@donglixp I will check whether the decoding id was correctly obtained or not. For the time being will close this issue, and if something comes up will re-open

Thanks!