Closed StarrySeas1 closed 2 years ago
Since model quantization is used, the range of the weight that can be expressed in PLM is not as large as fp32. So if we backward normally, the model may suffer from underflow, i.e., the gradient becomes too small to be expressed and go to zero, and keep being zero till the input template. Scaling the loss, so that the gradients are scaled too, can address this issue. Note that the learning rate of the input template is scaled down by a factor of 1024 too.
Thanks for your reply. Is this data set(ChnSentiCorp) not integrated into this repository, Or download the original data set myself for processing?
========================================================
Another question: I'm reading this code. I have several questions to ask you:
Encoder Input: token_01, token_02, token_03,..., token_mask, token_pad, token_pad Decoder Inout: token_start, token_mask, token_pad, token_pad,..token_pad
=======
英文太差可能为了不产生误解,以下是中文版描述。 另外一个问题: 我正在阅读这份代码,有几个疑问的地方找您请教下:
Encoder Input: token_01,token_02,token_03,...,token_mask,token_pad,token_pad Decoder Inout: token_start,token_mask,token_pad,token_pad,..token_pad
Thanks for your reply. Is this data set(ChnSentiCorp) not integrated into this repository, Or download the original data set myself for processing?
you can download it yourself.
When cpm2 is used for prompt tuning, is the input and output of PLM model in the following format?
yes
and the maximum length of label mapping words is 5
see also #79, we would avoid label words that have a length more than one in Chinese. We would rather use label words that are existed in the vocabulary.
Why does loss multiply by 1024 in the file (5.1_BMInf_CPM.py)?