Why does loss multiply by 1024 in the file (5.1_BMInf_CPM.py)?

StarrySeas1 commented 2 years ago

Achazwl commented 2 years ago

Since model quantization is used, the range of the weight that can be expressed in PLM is not as large as fp32. So if we backward normally, the model may suffer from underflow, i.e., the gradient becomes too small to be expressed and go to zero, and keep being zero till the input template. Scaling the loss, so that the gradients are scaled too, can address this issue. Note that the learning rate of the input template is scaled down by a factor of 1024 too.

StarrySeas1 commented 2 years ago

Thanks for your reply. Is this data set(ChnSentiCorp) not integrated into this repository, Or download the original data set myself for processing?

========================================================

Another question: I'm reading this code. I have several questions to ask you:

When cpm2 is used for prompt tuning, is the input and output of PLM model in the following format? If the template adopts "text: {'meta':'content'} and the text category is: {'mask'}"

Encoder Input: token_01, token_02, token_03,..., token_mask, token_pad, token_pad Decoder Inout: token_start, token_mask, token_pad, token_pad,..token_pad

After getting the output of PLM, Extract the output of the corresponding position of the token_mask，shape is:[bs,mask_nums,vocab_size]. According to the constructed tensor, label_words_ids(the construction method is, if there are two categories, each category corresponds to two "label mapping words", and the maximum length of label mapping words is 5, the label_words_ids shape is [2,2,5]), extract the value of the corresponding position of PLM output, the shape is [bs, mask_nums , 2, 2, 5]. Then call the function "handle_multi_token", the output values at multiple label token positions are processed into one value. At this time, the shape is updated to [bs, 2] as the output of the whole model.

=======

英文太差可能为了不产生误解，以下是中文版描述。另外一个问题：我正在阅读这份代码，有几个疑问的地方找您请教下：

在采用CPM2进行prompt tuning时，PLM模型的输入输出是下面这种格式吗，如果模板采用“文本: {'meta':'content'} 文本类别为: {'mask'}”

Encoder Input: token_01,token_02,token_03,...,token_mask,token_pad,token_pad Decoder Inout: token_start,token_mask,token_pad,token_pad,..token_pad

在拿到PLM的输出之后，提取出token mask对应位置的输出，shape为[bs,mask_nums,vocab_size]，然后提取出label_words_ids，如果是两个类别，每个类别对应两个“映射词”,映射词最大长度为5, label_words_ids的shape为[2,2,5]，按照label_words_ids进行对应词表位置的Prob后shape为[bs,mask_nums,2,2,5],然后根据“处理多个token”的方式，例如'first',mix等进行合并，shape为[bs,2]作为整个模型的输出。不知道是不是这个逻辑，如果是这个逻辑那么“合并多个token”的方式是不是有点不太合适？

Achazwl commented 2 years ago

Thanks for your reply. Is this data set(ChnSentiCorp) not integrated into this repository, Or download the original data set myself for processing?

you can download it yourself.

When cpm2 is used for prompt tuning, is the input and output of PLM model in the following format?

yes

and the maximum length of label mapping words is 5

see also #79, we would avoid label words that have a length more than one in Chinese. We would rather use label words that are existed in the vocabulary.

thunlp / OpenPrompt

Why does loss multiply by 1024 in the file (5.1_BMInf_CPM.py)? #150