microsoft / CodeBERT

CodeBERT
MIT License
2.25k stars 457 forks source link

How do you pre-train CodeReviewer? #224

Open oathaha opened 1 year ago

oathaha commented 1 year ago

In the CodeReviewer paper, I saw that there are 4 pre-training tasks. Can you explain more detail about how to pre-train each task?

I'm not sure if you pre-trained CodeReviewer by using one of the below methods

  1. Pre-train each task separately (so it means you pre-trained the model 4 rounds)
  2. Pre-train all tasks at the same time (like doing multi-task learning or something else).

I think it will also be great if you can share code for pre-training CodeReviewer.

Thanks.

Lizhmq commented 1 year ago

We pre-train CodeReviewer with the 2nd way, pre-training 4 tasks at the same time. The code for pre-training is not released but simple. Main code for processing pre-training data is in utils.py:TextDataset. You can write code as follows to pre-train the model where examples are TextDataset objects:

                source_ids = torch.tensor(
                    [ex.source_ids for ex in examples], dtype=torch.long
                ).to(local_rank)
                source_labels = torch.tensor(
                    [ex.source_labels for ex in examples], dtype=torch.long
                ).to(local_rank)
                target_ids = torch.tensor(
                    [ex.target_ids for ex in examples], dtype=torch.long
                ).to(local_rank)
                source_mask = source_ids.ne(tokenizer.pad_id)
                target_mask = target_ids.ne(tokenizer.pad_id)

                loss = model(
                    input_ids=source_ids,
                    input_labels=source_labels,
                    decoder_input_ids=target_ids,
                    attention_mask=source_mask,
                    decoder_attention_mask=target_mask,
                )

You can reopen the issue if needed.

oathaha commented 1 year ago

Thank you for your reply.

Based on your answer, I'm not sure whether you just pack input/label pairs of different tasks together to pre-train model?

Lizhmq commented 1 year ago

Basically, yes.

oathaha commented 1 year ago

I see. Thanks again.