How to build Bimodel to search code snippets? [CodeBERTa]

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.96k stars 27k forks source link

How to build Bimodel to search code snippets? [CodeBERTa] #5207

Closed hmdgit closed 4 years ago

hmdgit commented 4 years ago

Hi,

I would like build a code search engine model. The main purpose is that when I pass docstring, it should give me top-k associated code snippets as results.

I have a data in the form of (docstring, code), which means each docstring is associated with mentioned code snippet.

I have seen CodeBERTa fine tune code, but it is not using docstring in it. Is it possible to use this model?

Can you please give me some entry points to solve this problem by using hugging-face library?

julien-c commented 4 years ago

CodeBERTa was indeed trained on just the code so you would need to tweak the approach.

Did you read the paper for CodeSearchNet (https://arxiv.org/abs/1909.09436) by @hamelsmu?

hmdgit commented 4 years ago

Thanks Julien for your response.

I have taken an overview of the paper and its code, and I will try it.

But, can it be possible to solve it by using BERT huggingface library? What kind of tweaks do I need to apply in CodeBERT fine tune code?

Can it be solved by finetuning BertForQuestionAnswering code?

fengzhangyin commented 4 years ago

Maybe CodeBERT(https://arxiv.org/abs/2002.08155) is suitable for you.

hmdgit commented 4 years ago

This paper is of my high interest.... Is there fine tuning source code for that paper publicly available? or are there any short snippets available which can help in fine-tuning?

fengzhangyin commented 4 years ago

You can visit this link (https://github.com/microsoft/CodeBERT) .

hmdgit commented 4 years ago

Thanks for sharing. I will check and let you know about related concerns on the shared github repository

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.