Open oathaha opened 1 year ago
We pre-train CodeReviewer with the 2nd way, pre-training 4 tasks at the same time. The code for pre-training is not released but simple. Main code for processing pre-training data is in utils.py:TextDataset. You can write code as follows to pre-train the model where examples are TextDataset objects:
source_ids = torch.tensor(
[ex.source_ids for ex in examples], dtype=torch.long
).to(local_rank)
source_labels = torch.tensor(
[ex.source_labels for ex in examples], dtype=torch.long
).to(local_rank)
target_ids = torch.tensor(
[ex.target_ids for ex in examples], dtype=torch.long
).to(local_rank)
source_mask = source_ids.ne(tokenizer.pad_id)
target_mask = target_ids.ne(tokenizer.pad_id)
loss = model(
input_ids=source_ids,
input_labels=source_labels,
decoder_input_ids=target_ids,
attention_mask=source_mask,
decoder_attention_mask=target_mask,
)
You can reopen the issue if needed.
Thank you for your reply.
Based on your answer, I'm not sure whether you just pack input/label pairs of different tasks together to pre-train model?
Basically, yes.
I see. Thanks again.
In the CodeReviewer paper, I saw that there are 4 pre-training tasks. Can you explain more detail about how to pre-train each task?
I'm not sure if you pre-trained CodeReviewer by using one of the below methods
I think it will also be great if you can share code for pre-training CodeReviewer.
Thanks.