tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.09k stars 285 forks source link

How to create code embeddings from Java codebase and store it in a vector database? #180

Open shankernamami opened 1 year ago

shankernamami commented 1 year ago

Hi there team code2vec,

I am working on a personal project. My aim is to store a Java codebase in a vector database to run similarity searches and retrieve code files from the db relevant to my query. Queries can be of the type:

  1. Method creating database pool connection.
  2. Entity class linked to 'Subjects' table

Basically a query will be an activity performed by the codebase and I should return the package, classname, (and method if required).

My plan is to vectorize these search queries using a vectorizer present in your codebase, perform similarity search and return results.

My questions are:

  1. How can I generate vectors for Java code using a your pretrained model?
  2. Will it be a good idea to vectorize an English query for similarity search?
urialon commented 1 year ago

Hi @shankernamami , Thank you for your interest in our work!

See this part of the README: https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

See also these newer models/papers:

Best, Uri

shankernamami commented 1 year ago

@urialon Thank you! this answers my questions : )

asyed79gatech commented 4 months ago

Hi @shankernamami , Thank you for your interest in our work!

See this part of the README: https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

See also these newer models/papers:

Best, Uri

Hi I have used the same command indicated on the ReadMe link which is "-export_code_vectors". However doing so gives me the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 2 in record
         [[node IteratorGetNext (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]]

My command was

python3 code2vec.py --export_code_vectors --test new-data/test/AdministeredCommentsDto.cs --load models/csharp14m/saved_model_iter173.release

where "new-data/test/AdministeredCommentsDto.cs" is the path to the code snippet whose embeddings I am trying to create. I guess I am unable to determine the correct input file type. Guidance in this will be highly appreciated.

Thanks

urialon commented 4 months ago

Hi @asyed79gatech , Thank you for your interest in our work.

I believe that you haven't run the preprocess.sh script on the data.

However in general, I recommend using the newer https://github.com/neulab/code-bert-score project. It is based on Huggingface, which is actively maintained.

Best, Uri