How to create code2vec input

tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"

https://code2vec.org

MIT License

1.1k stars 286 forks source link

How to create code2vec input #186

Open messiGao opened 1 year ago

messiGao commented 1 year ago

I use command like “{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }“ ，then use ”{python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test file.txt}“，but get error “ {return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 4 in record [[{{node IteratorGetNext}}]] }”.

urialon commented 1 year ago

Hi @messiGao , Thank you for your interest in our work.

I think there is a confusion, because the exception that is raised is coming from TensorFlow, while the java command that you mentioned does not involve TensorFlow at all.

May I also ask what kinds of tasks are you looking into? Maybe I can recommend a newer model.

Best, Uri

messiGao commented 1 year ago

I want to use the “--test” command to export .vectors,but I don't know what kind of TEST_FILE is correct。when i ask gpt-4， the answer is use the JavaExtractor to convert my test.java to test.txt。

messiGao commented 1 year ago

Additionally,My aim is to store a Java codebase in a vector database to run similarity searches and retrieve code files from the db relevant to my query.

urialon commented 1 year ago

Hi @messiGao ,

Please see https://github.com/neulab/code-bert-score You don't need the approach itself, but it contains Huggingface models, and one specifically for java called neulab/codebert-java.

This will allow you to use the Huggingface library with that model and a BERT-like framework.

Best, Uri

asyed79gatech commented 9 months ago

I have a similar dilemma with regards to creating embeddings of csharp code using a code2vec model I have trained. As @messiGao mentioned, I want to use the "--test" command to create .vectors file as mentioned in the repo but when i execute the command, it gives the following error:


tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 2 in record
         [[node IteratorGetNext (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]]```

urialon commented 9 months ago

Hi @asyed79gatech , Thank you for your interest in our work.

I believe that you haven't run the preprocess.sh script on the data.

However in general, I recommend using the newer https://github.com/neulab/code-bert-score project. It is based on Huggingface, which is actively maintained.

Best, Uri

asyed79gatech commented 9 months ago

Hi @urialon

Thanks for your prompt response. I thought we only needed to run the preprocess.sh script while training the code2vec model. Right now, I already have a trained model released and want it to generate embeddings for vector store.

XuPing1234 commented 4 months ago

我使用像“{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }”这样的命令，然后使用“{python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test file.txt}”，但出现错误“ {return tf_session。TF_SessionRun_wrapper（self._session、选项、feed_dict、tensorflow.python.framework.errors_impl。InvalidArgumentError：预期有 201 个字段，但记录中有 4 个字段 [[{{node IteratorGetNext}}]] }“。

Hello, have you resolved your issue? How can Java source code be converted into the input format required by code2vec?

zhaojialinnn commented 3 months ago

我使用像“{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }”这样的命令，然后使用“{python3 code2vec .py --load models/java14_model/saved_model_iter8.release --test file.txt}”，但出现错误“ {return tf_session。TF_SessionRun_wrapper（self._session、选项、feed_dict、tensorflow.python.framework.errors_impl。InvalidArgumentError：预期有 201 个字段，但记录有 4 个字段 [[{{node IteratorGetNext}}]] }“。

您好，您的问题解决了吗？Java 源代码如何转换成 code2vec 所需的输入格式？

hello, I encountered the same issue. Have you resolved it?