microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.5k stars 363 forks source link

fix: Solving the problem of fine-tuning Bert and DistilBert #167

Closed edwardqin-creator closed 1 year ago

edwardqin-creator commented 1 year ago

This pull request enables us to fine-tune Bert and DistilBert on the Clone-detection-POJ-104 task. Solving IndexError: tuple index out of range.

Key Improvements and Changes Include:

  1. Changed the model config in run.py: To finish this task, we should use BertModel and DistilModel instead of BertForMaskedLM and DistilBertForMaskedLM. This task does not involve filling in missing tokens in text, which is the primary purpose of the Masked Language Modeling task (BertForMaskedLM and DistilBertForMaskedLM). Instead, the task involves returning Top K codes with the same semantic as the input code, which requires the model to have a good understanding of the semantic relationship between different pieces of code. BertModel and DistilBertModel are trained on a broader range of tasks, including tasks that require understanding of semantic relationships between different pieces of text, such as the Sentence Similarity task. Therefore, BertModel and DistilBertModel are better suited for this task of returning codes with similar semantics.

  2. Resolved the issue of accessing output elements out of bounds: For DistilBert fine-tuning, the original model.py has some issues. Specifically, the code attempts to access the second output element of the model by using the following line in model.py:

    outputs = self.encoder(input_ids, attention_mask=input_ids.ne(1))[1]

    This works for the Bert and CodeBert models, which have a pooled_output concept in their outputs. However, since DistilBERT was not pre-trained on the Next Sentence Prediction task, its output does not include the pooled_output concept. Instead, its output only includes hidden states and attention distributions. To make the code compatible with DistilBert, we can make the following modifications: first, we obtain all of the encoder outputs. Then, if the output length is greater than 1 (for Bert or CodeBert), we use the second output (pooled_output). Otherwise, if we are using DistilBert, we take the first token ([CLS]) of the first output (sequence_output) as the representation of the entire sequence (since the Transformer's self-attention mechanism allows the [CLS] token's vector to capture information about the entire sequence, although this is not equivalent to the pooled_output of BERT, it is a common practice).

    outputs=self.encoder(input_ids,attention_mask=input_ids.ne(1))
    if len(outputs) > 1:
    outputs = outputs[1]
    else:
    outputs = outputs[0][:, 0, :]

log: This pull request aim to solve the problem of fine-tuning Bert and DistilBert.

edwardqin-creator commented 1 year ago

@microsoft-github-policy-service agree [company="Huazhong University of Science and Technology"]

edwardqin-creator commented 1 year ago

@microsoft-github-policy-service agree company="Huazhong University of Science and Technology"