salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Reading data for code generation. #27

Closed BakingBrains closed 2 years ago

BakingBrains commented 2 years ago

For code generation task, should I use the data reading method used for concode.

def read_concode_examples(filename, data_num):
    """Read examples from filename."""
    examples = []

    with open(filename) as f:
        for idx, line in enumerate(f):
            x = json.loads(line)
            examples.append(
                Example(
                    idx=idx,
                    source=x["nl"].strip(),
                    target=x["code"].strip()
                )
            )
            idx += 1
            if idx == data_num:
                break
    return examples

OR Data reading method used for code summarization (here replacing source with the docstring_tokens and target with code_tokens.

def read_summarize_examples(filename, data_num):
    """Read examples from filename."""
    examples = []
    with open(filename, encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = line.strip()
            js = json.loads(line)
            if 'idx' not in js:
                js['idx'] = idx
            code = ' '.join(js['code_tokens']).replace('\n', ' ')
            code = ' '.join(code.strip().split())
            nl = ' '.join(js['docstring_tokens']).replace('\n', '')
            nl = ' '.join(nl.strip().split())
            examples.append(
                Example(
                    idx=idx,
                    source=nl,
                    target=code,
                )
            )
            if idx + 1 == data_num:
                break
    return examples

Any suggestions?

Also, do I need to change the args.max_source_length = 256 and args.max_target_length = 128 for code generation task?

yuewang-cuhk commented 2 years ago

Hi, the data reading functions surely can be customized according to your needs. If you want to fine-tune on Concode code generation task, you employ the former read_concode_examples. Or if you want to reverse the CodeSearchNet summarization into a text-to-code generation task, you can modify the read_summarize_examples.

For the maximum source/target lengths, these are usually determined by the tokenized lengths of your (source, target) pairs and the limitation of GPU memory at some cases. You can tune these hyper-parameters as well.