salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

Code similarity CodeT5-large/small #134

Open lyriccoder opened 1 year ago

lyriccoder commented 1 year ago

Thank you for your interest in utilizing our Codet5 model for code similarity tasks. I have a query regarding its usage in test mode, specifically when comparing only two code snippets. As per the CodexGlue dataset format, the model expects a list of codes and returns the top n most similar examples to a given query. However, I would like to inquire about the possibility of checking the similarity between two specific code snippets. Is there a way to utilize your model for this purpose? I kindly request guidance on obtaining a similarity score, such as a probability, or a binary output (0 or 1) indicating whether the two code snippets are similar or different. For instance, given the following two code snippets:

public void foo() { System.out.println("Hi")}
protected DecryptedEndPoint newDecryptedEndPoint()
    {
        return new DecryptedEndPoint();
    }

Can your model provide insights into their similarity or equivalence?

yuewang-cuhk commented 1 year ago

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

liying-sf commented 7 months ago

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

Hi, CodeT5+ 110m embedding model has a limit of 512 tokens input, is there any way to increase the input limit of the model ? I would appreciate it if you would give me some advice.