salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.71k stars 396 forks source link

Maximum sequence length in the clone detection task #24

Closed h4iku closed 2 years ago

h4iku commented 2 years ago

Hi, Thanks for sharing this great work.

I have a question about the maximum sequence length of the inputs of the clone detection task. In the paper, it is mentioned that the maximum source and target sequence lengths are set to be 512 and 256, respectively. However, in the clone detection task script, the src_len and trg_len are both set to 400: https://github.com/salesforce/CodeT5/blob/ad787aa6bb08ede41c94c397e577adfc7ebac39d/sh/run_exp.py#L60-L64

Then these input strings are tokenized, padded and concatenated in convert_clone_examples_to_features making an 800 token input: https://github.com/salesforce/CodeT5/blob/ad787aa6bb08ede41c94c397e577adfc7ebac39d/_utils.py#L69-L71

Could you please explain how this works considering the mentioned maximum length limit?

yuewang-cuhk commented 2 years ago

Hi, we set the maximum source and target sequence lengths are set to be 512 and 256 for efficient pretraining. For fine-tuning, the hyperparameters can be tuned based on the downstream tasks. We set src_len and trg_len to 400 for clone detection task with two considerations: 1) we feed the same sequence to both encoder and decoder, and obtain the last decoder state as the sequence representation (similar to BART-style). So we use the same maximum length for encoder/decoder. 2) Given that the average sequence length is around 320, we empirically pick 400 so that it can encode most of the sequence contexts and also facilitate efficient training.

For why we concatenate code1 and code2 into a 800 token input: you can find that we reshape it back to 400 here to obtain each of their representations and combine them together for clone detection prediction here.

h4iku commented 2 years ago

Great, thanks for the answer.

yakirba commented 1 year ago

hi, what are the max context window in latest codet5p models? thanks!