Closed akhileshgotmare closed 3 years ago
Hi, thank you for your attention.
<EOL>
when pre-training. <s>
and </s>
are added at the beginning and the end of a source code file, respectively.<s> pl_string </s>
.\n
with <EOL>
.Thanks for the details @celbree For pre-training, are these hyper-parameters the same as the ones used in the fine-tuning scripts? number of epochs, learning rate, per_gpu_train_batch_size, weight decay, gradient acc steps, optimizer
The hyperprarameters are different from those in fine-tuning scripts.
Thanks @celbree! I'm assuming the above ones you shared are for the Java model, are these the same for the python model? Did you perform any kind of hyper-parameter tuning or were these chosen randomly?
Yes. The same for both datasets.
Hi @celbree I'm trying to replicate CodeGPT java (adapted, so starting from openai gpt-2 checkpoint) pre-training on CodeSearchNet and would like to clarify some aspects:
- Did you use
TextDataset
class from CodeXGLUE/Code-Code/CodeCompletion-token/code/dataset.py and therun_lm.py
script as described here without the--not_pretrain
argument for pre-training?- Codesearchnet has bimodal (~0.5M NL-PL pairs) and unimodal (~1.1M PL only examples) data. Did you use only PL from these two sources or did you use the NL-PL pairs too? From the
TextDataset
class it seems you've used all the 1.6M samples, but only PL part, but just wanted to confirm. :)- Did you split the
java_dedupe_definitions_v2.pkl
file from CodeSearchNet into train, test, val parts or use the entire set?- Are
<s>
</s>
<EOL>
tokens used in pre-training? if yes, how? Specifically:- If answer to 3 is yes and if you used NL-PL, how was the pre-processing done? (One way could be
<s> + nl_string + <EOL> + pl_string + </s>
)- How was the preprocessing done for PL only? One way could be
<s> + pl_string + </s>
wherepl_string
is processed so it contains<EOL>
at line breaks?
Hi,did you successfully replicate CodeGPT java or python? I also want to repre-train CodeGPT. But I don't know how to implement it. If you have replicated it, can you share the pre-train code with me?
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Code completion (both token level and line level) pipeline in CodeXGLUE
"""
from __future__ import absolute_import, division, print_function
import argparse
import glob
import logging
import os
import pickle
import random
import re
import shutil
import json
import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler,TensorDataset
from torch.utils.data.distributed import DistributedSampler
from dataset import TextDataset, finetuneDataset, EvalDataset, lineDataset, CodeSearchNetDataset
from beam import Beam
try:
from torch.utils.tensorboard import SummaryWriter
except:
from tensorboardX import SummaryWriter
from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup,
BertConfig, BertForMaskedLM, BertTokenizer,
GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
RobertaConfig, RobertaForMaskedLM, RobertaTokenizer,
DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer)
from model import RNNModel
logger = logging.getLogger(__name__)
MODEL_CLASSES = {
'gpt2-xl': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
'rnn': (GPT2Config, RNNModel, GPT2Tokenizer),
'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
'distilbert': (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer)
}
def load_and_cache_examples(args, tokenizer, evaluate=False):
if args.not_pretrain:
dataset = finetuneDataset(tokenizer, args, logger, file_type='dev' if evaluate else 'train',
block_size=args.block_size)
else:
dataset = CodeSearchNetDataset(tokenizer, args, logger, file_type='dev' if evaluate else 'train',
block_size=args.block_size)
return dataset
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
def update_config(args, config):
# config.n_positions = config.n_ctx = args.block_size
config.vocab_size = args.vocab_size
def train(args, train_dataset, model, tokenizer, fh, pool):
""" Train the model """
if args.local_rank in [-1, 0]:
args.tensorboard_dir = os.path.join(args.output_dir, 'tensorboard')
if not os.path.exists(args.tensorboard_dir):
os.makedirs(args.tensorboard_dir)
tb_writer = SummaryWriter(args.tensorboard_dir)
args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.batch_size, drop_last=True)
total_examples = len(train_dataset) * (
torch.distributed.get_world_size() if args.local_rank != -1 else 1)
batch_size = args.batch_size * args.gradient_accumulation_steps * (
torch.distributed.get_world_size() if args.local_rank != -1 else 1)
# if args.max_steps > 0:
# t_total = args.max_steps
# args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
if args.num_train_epochs > 0:
t_total = total_examples // batch_size * args.num_train_epochs
"""Commented by author"""
#args.max_steps = t_total #commented by author
model.to(args.device)
if args.local_rank not in [-1, 0]:
torch.distributed.barrier()
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
'weight_decay': args.weight_decay},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
#"""hack for 71k"""
#args.learning_rate = (57228/92228)*4*0.0001
#args.learning_rate = (30880/55880)*4*(0.0001)
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps,
num_training_steps=t_total)
# """"hack for 71k interruption"""
# scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps,
# num_training_steps=57228)
checkpoint_last = os.path.join(args.output_dir, 'checkpoint-last')
scheduler_last = os.path.join(checkpoint_last, 'scheduler.pt')
optimizer_last = os.path.join(checkpoint_last, 'optimizer.pt')
"""comment the if block below for hack for 71k interruption"""
if os.path.exists(scheduler_last):
if args.dont_load_opt:
logger.warning(f"Scheduler exists at {optimizer_last}")
logger.warning("But we are not loading it")
else:
scheduler.load_state_dict(torch.load(scheduler_last, map_location="cpu"))
if os.path.exists(optimizer_last):
if args.dont_load_opt:
logger.warning(f"Optimizer exists at {optimizer_last}")
logger.warning("But we are not loading it")
else:
logger.warning(f"Loading optimizer from {optimizer_last}")
optimizer.load_state_dict(torch.load(optimizer_last, map_location="cpu"))
if args.local_rank == 0:
torch.distributed.barrier()
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank%args.gpu_per_node],
output_device=args.local_rank%args.gpu_per_node,
find_unused_parameters=True)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", total_examples )
logger.info(" Num epoch = %d", t_total*batch_size//total_examples)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d", batch_size)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = args.start_step
tr_loss, logging_loss,avg_loss,tr_nb = 0.0, 0.0, 0.0, global_step
# model.resize_token_embeddings(len(tokenizer))
model.zero_grad()
set_seed(args) # Added here for reproducibility (even between python 2 and 3)
for idx in range(args.start_epoch, int(args.num_train_epochs)):
for step, batch in enumerate(train_dataloader):
inputs, labels = (batch, batch)
inputs = inputs.to(args.device)
labels = labels.to(args.device)
model.train()
outputs = model(inputs, labels=labels)
loss = outputs[0]
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
scheduler.step()
global_step += 1
output_flag=True
avg_loss=round(np.exp((tr_loss - logging_loss) /(global_step- tr_nb)),4)
if global_step % args.logging_steps == 0:
logger.info(" steps: %s ppl: %s lr: %s", global_step, round(avg_loss,5), scheduler.get_last_lr()[0])
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
# Log metrics
tb_writer.add_scalar('lr', scheduler.get_last_lr()[0], global_step)
tb_writer.add_scalar('loss', (tr_loss - logging_loss) / args.logging_steps, global_step)
logging_loss = tr_loss
tr_nb=global_step
if (args.local_rank in [-1, 0] and args.save_steps > 0 and
(global_step % args.save_steps == 0 or global_step in [t_total, t_total-1])):
checkpoint_prefix = "checkpoint"
# Save model checkpoint
if args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer, eval_when_training=True)
for key, value in results.items():
tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
logger.info(" %s = %s", key, round(value,4))
output_dir = os.path.join(args.output_dir, '{}-{}-{}'.format(checkpoint_prefix, global_step, round(results['perplexity'],4)))
else:
output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
if args.model_type == "rnn":
torch.save(model_to_save.state_dict(), os.path.join(output_dir, "model.pt"))
else:
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
# _rotate_checkpoints(args, checkpoint_prefix)
last_output_dir = os.path.join(args.output_dir, 'checkpoint-last')
if not os.path.exists(last_output_dir):
os.makedirs(last_output_dir)
if args.model_type == "rnn":
torch.save(model_to_save.state_dict(), os.path.join(last_output_dir, "model.pt"))
else:
model_to_save.save_pretrained(last_output_dir)
tokenizer.save_pretrained(last_output_dir)
idx_file = os.path.join(last_output_dir, 'idx_file.txt')
with open(idx_file, 'w', encoding='utf-8') as idxf:
idxf.write(str(0) + '\n')
torch.save(optimizer.state_dict(), os.path.join(last_output_dir, "optimizer.pt"))
torch.save(scheduler.state_dict(), os.path.join(last_output_dir, "scheduler.pt"))
logger.info("Saving optimizer and scheduler states to %s", last_output_dir)
step_file = os.path.join(last_output_dir, 'step_file.txt')
with open(step_file, 'w', encoding='utf-8') as stepf:
stepf.write(str(global_step) + '\n')
if args.max_steps > 0 and global_step > args.max_steps:
break
if args.max_steps > 0 and global_step > args.max_steps:
break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step
def evaluate(args, model, tokenizer, prefix="", eval_when_training=False):
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_output_dir = args.output_dir
eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, drop_last=True)
# multi-gpu evaluate
if args.n_gpu > 1 and eval_when_training is False:
model = torch.nn.DataParallel(model)
# Eval!
#logger.info("***** Running evaluation {} *****".format(prefix))
#logger.info(" Num examples = %d", len(eval_dataset))
#logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
model.eval()
for batch in eval_dataloader:
inputs, labels = (batch, batch)
inputs = inputs.to(args.device)
labels = labels.to(args.device)
with torch.no_grad():
outputs = model(inputs, labels=labels)
lm_loss = outputs[0]
eval_loss += lm_loss.mean().item()
nb_eval_steps += 1
eval_loss = eval_loss / nb_eval_steps
perplexity = torch.exp(torch.tensor(eval_loss))
result = {
"perplexity": float(perplexity)
}
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
#logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
#logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return result
def eval_acc(args, model, tokenizer, file_type='test'):
"""
Evaluate token level code completion on accuracy.
This function can only used to evaluate accuracy, but not inference, because the inputs are previous sub-tokens but not tokens.
But it can be guaranteed that the accuracy in this function is the same as the real token level completion.
The reason is:
Assuming the inputs are "context_len = 100 <EOL> masks = np . zeros (", and the ground truth is "context_len".
Due to our bpe encoding, the model have to outputs "context", "_" and "len" in 3 time step, i.e. gt0="context", gt1="_", gt2="len".
In a real inference scenario:
time step 0, inputs "context_len = 100 <EOL> masks = np . zeros ( ", model outputs: out0;
time step 1, inputs: in1=out0, outputs: out1
... until the model outputs a complete token
But in this function, no matter out0 is, in1=gt0="context".
That is to say, in this function, we feed ground truth but not output sub-token when we predict the next token which is split by bpe.
So obviouly we would get different predictions from the real token completion scenario.
However, if we calculate token leval accuracy,
if and only if the model predicts every sub-token correctly, the complete token can be seen correct.
In this situation, out0==gt0, out1==gt1, so it doesn't matter we feed gt or output to model.
In summary, this function can make models oupout the same complete token if this token equals to ground truth,
if not, the model might predict a different token from the real completion scenario, but all wrong.
So it would not affect the token level accuracy.
I use this trick to speed up evaluation due to the large test set.
"""
eval_dataset = EvalDataset(tokenizer, args, logger, file_type=file_type, block_size=args.block_size)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
model.to(args.device)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank%args.gpu_per_node],
output_device=args.local_rank%args.gpu_per_node,
find_unused_parameters=True)
model.eval()
correct = 0.0
total = 0
total_pred = []
total_gt = []
for step, batch in enumerate(eval_dataloader):
inputs = batch.to(args.device)
with torch.no_grad():
outputs = model(inputs)
pred_scores = outputs[0]
pred_ids = pred_scores.argmax(-1)
all_pred = []
all_gt = []
prev_pred = None
for pred, gt in zip(pred_ids, inputs):
pred = pred.cpu().tolist()
gt = gt.cpu().tolist()
for i, y in enumerate(gt):
if i == 0:
if y in [tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id]:
now_gt = [y]
now_pred = [0] if prev_pred is None else [prev_pred]
all_pred.append(tokenizer.decode(now_pred).strip().split()[0])
all_gt.append(tokenizer.decode(now_gt).strip())
now_gt = []
now_pred = []
else:
now_gt = [y]
now_pred = [0] if prev_pred is None else [prev_pred]
else:
if tokenizer.convert_ids_to_tokens(y)[0] == '\u0120':
if len(now_gt) > 0:
try:
all_pred.append(tokenizer.decode(now_pred).strip().split()[0])
except IndexError:
all_pred.append("<SPACE>")
all_gt.append(tokenizer.decode(now_gt).strip())
now_gt = []
now_pred = []
if y in [tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id]:
if len(now_gt) > 0:
try:
all_pred.append(tokenizer.decode(now_pred).strip().split()[0])
except IndexError:
all_pred.append("<SPACE>")
all_gt.append(tokenizer.decode(now_gt).strip())
now_gt = [y]
now_pred = [pred[i-1]]
try:
all_pred.append(tokenizer.decode(now_pred).strip().split()[0])
except IndexError:
all_pred.append("<SPACE>")
all_gt.append(tokenizer.decode(now_gt).strip())
now_gt = []
now_pred = []
continue
now_gt.append(y)
now_pred.append(pred[i-1])
assert len(all_pred) == len(all_gt)
total_pred.extend(all_pred)
total_gt.extend(all_gt)
for x, y in zip(all_pred, all_gt):
if y not in ["<s>", "</s>", "<EOL>", "<pad>"]:
total += 1
if x == y:
correct += 1
if step % args.logging_steps == 0:
logger.info(f"{step} are done!")
logger.info(f"{total}, {correct/total}")
# pickle.dump(total_pred, open(os.path.join(args.output_dir, "preds.pkl"), "wb"))
# pickle.dump(total_gt, open(os.path.join(args.output_dir, "gts.pkl"), "wb"))
saved_file = os.path.join(args.output_dir, "predictions.txt")
total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file)
logger.info(f"Eval on {total_samples}, saved at {saved_file}")
return total, correct
def post_process(args, preds, gts, true_gts, saved_file):
wf = open(saved_file, "w")
cnt = 0
new_gt = []
new_pred = []
for i, (pred,gt) in enumerate(zip(preds,gts)):
if gt in ["", "<pad>"]:
continue
new_gt.append(gt)
new_pred.append(pred.replace(" ", ""))
if gt == "</s>":
gt_str = " ".join(new_gt)
pred_str = " ".join(new_pred)
assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt"
wf.write(pred_str+"\n")
cnt += 1
new_gt = []
new_pred = []
return cnt
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--data_dir", default=None, type=str, required=True,
help="The input data path.")
parser.add_argument("--langs", default=None, type=str, required=True,
help="Languages to train, if all, train all languages in data_dir")
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
## Other parameters
parser.add_argument("--model_type", default="gpt2", type=str,
help="The model architecture to be fine-tuned.")
parser.add_argument("--pretrain_dir", default="", type=str,
help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument("--config_dir", type=str,
help="config name. Required when training from scratch")
parser.add_argument("--tokenizer_dir", type=str,
help="Pre-trained tokenizer dir. Required when training from scratch")
parser.add_argument("--load_name", type=str, default="pretrained",
help="Load pretrained model name")
parser.add_argument("--mlm", action='store_true',
help="Train with masked-language modeling loss instead of language modeling.")
parser.add_argument("--mlm_probability", type=float, default=0.15,
help="Ratio of tokens to mask for masked language modeling loss")
parser.add_argument("--cache_dir", default="", type=str,
help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
parser.add_argument("--block_size", default=1024, type=int,
help="Optional input sequence length after tokenization."
"The training dataset will be truncated in block of this size for training."
"Default to the model max input length for single sentence inputs (take into account special tokens).")
parser.add_argument("--do_train", action='store_true',
help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true',
help="Whether to run eval on the dev set.")
parser.add_argument("--evaluate_during_training", action='store_true',
help="Run evaluation during training at each logging step.")
parser.add_argument("--do_lower_case", action='store_true',
help="Set this flag if you are using an uncased model.")
parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument("--per_gpu_eval_batch_size", default=12, type=int,
help="Batch size per GPU/CPU for evaluation.")
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument("--learning_rate", default=5e-5, type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float,
help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float,
help="Max gradient norm.")
parser.add_argument("--num_train_epochs", default=1.0, type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--max_steps", default=-1, type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
parser.add_argument("--warmup_steps", default=0, type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument('--logging_steps', type=int, default=1000,
help="Log every X updates steps.")
parser.add_argument('--save_steps', type=int, default=5000,
help="Save checkpoint every X updates steps.")
parser.add_argument('--save_total_limit', type=int, default=None,
help='Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default')
parser.add_argument("--eval_all_checkpoints", action='store_true',
help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
parser.add_argument("--no_cuda", action='store_true',
help="Avoid using CUDA when available")
parser.add_argument('--overwrite_output_dir', action='store_true',
help="Overwrite the content of the output directory")
parser.add_argument('--overwrite_cache', action='store_true',
help="Overwrite the cached training and evaluation sets")
parser.add_argument('--seed', type=int, default=42,
help="random seed for initialization")
parser.add_argument('--not_pretrain', action='store_true',
help="use different dataset")
parser.add_argument('--fp16', action='store_true',
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
parser.add_argument('--fp16_opt_level', type=str, default='O1',
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html")
parser.add_argument("--local_rank", type=int, default=-1,
help="For distributed training: local_rank")
parser.add_argument("--node_index", type=int, default=-1,
help="node index if multi-node running")
parser.add_argument("--gpu_per_node", type=int, default=-1,
help="num of gpus per node")
parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
parser.add_argument('--log_file', type=str, default='')
parser.add_argument('--tensorboard_dir', type=str)
parser.add_argument('--use_sfra_data_too', action='store_true',
help="use additional pretraining dataset from Weishi (which is roughly 4X what we have from CDSN, be careful in using the right cache when using this arg; loading features over-writes this decision")
parser.add_argument('--use_sfra_data_only', action='store_true',
help="ONLY use pretraining dataset from Weishi (which is roughly 4X what we have from CDSN, be careful in using the right cache when using this arg; loading features over-writes this decision")
parser.add_argument('--dont_load_opt', action='store_true',
help="do not load optimizer (if it exists from the pre-trained ckpt) | this arg doesn't matter if it doesn't exist ")
parser.add_argument('--use_NL_too', action='store_true',
help="codeGPT pre-training ignored NL part, with this arg we will include NL part (docstring) in our pretraining along with concode_field_sep (added as a special token) between NL and PL")
pool = None
args = parser.parse_args()
# args.output_dir = os.path.join(args.output_dir, args.dataset)
if args.model_type in ["bert", "roberta", "distilbert"] and not args.mlm:
raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
"flag (masked language modeling).")
if os.path.exists(args.output_dir) and os.listdir(
args.output_dir) and args.do_train and not args.overwrite_output_dir:
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir))
# Setup distant debugging if needed
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
logger.info("local_rank: %d, node_index: %d, gpu_per_node: %d"%(args.local_rank, args.node_index, args.gpu_per_node))
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend='nccl')
args.local_rank += args.node_index * args.gpu_per_node
args.n_gpu = 1
args.device = device
# args.batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
# Setup logging
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt='%m/%d/%Y %H:%M:%S',
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s, world size: %s",
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16,
torch.distributed.get_world_size() if args.local_rank != -1 else 1)
# 使用FileHandler输出到文件
fh = logging.FileHandler(args.log_file)
logger.addHandler(fh)
# Set seed
set_seed(args)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Barrier to make sure only the first process in distributed training download model & vocab
args.start_epoch = 0
args.start_step = 0
checkpoint_last = os.path.join(args.output_dir, 'checkpoint-last')
if args.do_train and os.path.exists(checkpoint_last) and os.listdir(checkpoint_last):
args.pretrain_dir = os.path.join(checkpoint_last)
args.config_name = os.path.join(checkpoint_last, 'config.json')
idx_file = os.path.join(checkpoint_last, 'idx_file.txt')
with open(idx_file, encoding='utf-8') as idxf:
args.start_epoch = int(idxf.readlines()[0].strip()) + 1
step_file = os.path.join(checkpoint_last, 'step_file.txt')
if os.path.exists(step_file):
with open(step_file, encoding='utf-8') as stepf:
args.start_step = int(stepf.readlines()[0].strip())
logger.info("reload model from {}, resume from {} steps".format(checkpoint_last, args.start_step))
# Load pre-trained model
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
pretrained = args.pretrain_dir
if pretrained:
tokenizer = tokenizer_class.from_pretrained(pretrained, do_lower_case=args.do_lower_case, sep_token='<EOL>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', unk_token='<|UNKNOWN|>')
#added by akhilesh
special_tokens_dict = {'sep_token': "<EOL>", "bos_token":'<s>', "eos_token":'</s>',
"pad_token" : '<pad>', "unk_token":"<|UNKNOWN|>"}
if args.use_NL_too:
special_tokens_dict.update({'additional_special_tokens': ['concode_field_sep','concode_elem_sep']})
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) #added by akhilesh
#Concode from CodeXGLUE uses concode_field_sep between NL and code and
#concode_elem_sep to separate elements within code
#We will use concode_field_sep between NL and PL when pre-training includes NL part
if args.model_type == "rnn":
model = model_class(len(tokenizer), 768, 768, 1)
model_last = os.path.join(pretrained, 'model.pt')
if os.path.exists(model_last):
logger.warning(f"Loading model from {model_last}")
model.load_state_dict(torch.load(model_last, map_location="cpu"))
else:
model = model_class.from_pretrained(pretrained)
model.resize_token_embeddings(len(tokenizer))
else:
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_dir, sep_token='<EOL>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', unk_token='<|UNKNOWN|>')
args.vocab_size = len(tokenizer)
if args.model_type == "rnn":
model = model_class(len(tokenizer), 768, 768, 1)
else:
config = config_class.from_pretrained(args.config_dir)
model = model_class(config)
model.resize_token_embeddings(len(tokenizer))
model_parameters = model.parameters()
num_params = sum([np.prod(p.size()) for p in model_parameters])
logger.info(f"Model has a total of {num_params} trainable parameters")
if args.local_rank == 0:
torch.distributed.barrier() # End of barrier to make sure only the first process in distributed training download model & vocab
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer, fh, pool)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Only works on single GPU
if args.do_eval:
# dev_total, dev_cr = eval_acc(args, model, tokenizer, 'dev')
# logger.info(f"Dev total tokens: {dev_total}, accuracy: {dev_cr/dev_total}")
test_total, test_cr = eval_acc(args, model, tokenizer, 'test')
logger.info(f"Test total tokens: {test_total}, accuracy: {test_cr/test_total}")
if __name__ == "__main__":
main()
Thanks for your reply! Can a pre-trained model be used for code generation tasks in this way?
@skye95git In principle it can be. But given the size of this model and the fairly limited amount of data it has seen, the generation capabilities of this model will likely not be comparable to recent models like CodeX.
@skye95git In principle it can be. But given the size of this model and the fairly limited amount of data it has seen, the generation capabilities of this model will likely not be comparable to recent models like CodeX.
Yes, I found out that Codex is the model used by Copilot, built on GPT-3. But there seems to be no open source model or associated code. I don't have access to the training and evaluation of models.
Hi @celbree I'm trying to replicate CodeGPT java (adapted, so starting from openai gpt-2 checkpoint) pre-training on CodeSearchNet and would like to clarify some aspects:
TextDataset
class from CodeXGLUE/Code-Code/CodeCompletion-token/code/dataset.py and therun_lm.py
script as described here without the--not_pretrain
argument for pre-training?TextDataset
class it seems you've used all the 1.6M samples, but only PL part, but just wanted to confirm. :)java_dedupe_definitions_v2.pkl
file from CodeSearchNet into train, test, val parts or use the entire set?<s>
</s>
<EOL>
tokens used in pre-training? if yes, how? Specifically:<s> + nl_string + <EOL> + pl_string + </s>
)<s> + pl_string + </s>
wherepl_string
is processed so it contains<EOL>
at line breaks?