[Code-to-Text]: pretrained model from hugging face has different keys than expected from model.py

microsoft / CodeXGLUE

CodeXGLUE

MIT License

1.51k stars 363 forks source link

[Code-to-Text]: pretrained model from hugging face has different keys than expected from model.py #125

Open poojitharamachandra opened 2 years ago

poojitharamachandra commented 2 years ago

hi, I downloaded the model from https://huggingface.co/microsoft/codebert-base/tree/main and using to run an inference (without fine-tuning). But unable to load the model file pytorch_model.bin as there is mismatch in the keys.

Seq2Seq model expect keys to be in format : "encoder.encoder.layer.1.attention.output.dense.bias" , but the saved model has keys in the format of 'encoder.layer.1.attention.output.dense.bias'

celbree commented 2 years ago

You should use an encoder model to load CodeBERT. If you want to use CodeBERT to initialize the encoder in a Seq2Seq model, you can refer to our implementation here. https://github.com/microsoft/CodeXGLUE/blob/main/Code-Text/code-to-text/code/run.py#L261

poojitharamachandra commented 2 years ago

Could you please elaborate how to use the pre-trained model to generate descriptions for my code? is fine-tuning necessary?

celbree commented 2 years ago

Yes. It needs fine-tuning. CodeBERT is only an encoder model. If you want to generate descriptions, you need a Seq2Seq model that uses CodeBERT as encoder to encoding codes and uses a decoder to generate comments. The decoder needs to be trained from scratch. Following the instructions https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text to fine-tune CodeBERT for this purpose. And if you want better performance, you could try the newest SOTA model UniXcoder.

poojitharamachandra commented 2 years ago

Thanks. I was able to fine tune and make an inference. I will also check UniXcoder. In the literature you say that 15% of the tokens are masked for MLM. Where do you do it in the code? During inference, the fine tuned model predicts the whole sentence or just parts of it?

celbree commented 2 years ago

In the literature you say that 15% of the tokens are masked for MLM. Where do you do it in the code?

MLM is for model pre-training, while this repo only fine-tunes these models.

During inference, the fine tuned model predicts the whole sentence or just parts of it?

The whole sentence.

poojitharamachandra commented 2 years ago

Thanks. Could you direct me to repo with the model architecture?

celbree commented 2 years ago

If you mean the pre-training code of CodeBERT, I think it isn't released. If you need the detailed model architecture, please refer to huggingface's transformers repo https://github.com/huggingface/transformers. CodeBERT shares the same architecture of RoBERTa which you can find there.

poojitharamachandra commented 2 years ago

Thanks. Do you release the fine-tuned model (UniXCoder) for code-summarization task , that is fine-tuned on C projects?