tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Dataset format #96

Closed celsofranssa closed 4 years ago

celsofranssa commented 4 years ago

I am starting to work with code2vec and wondering about the dataset format.

In the following instance: test|field|injection|with|map bf,-400155226,testbean size,-1639730666,assertequals ... I imagine that test|field|injection|with|map is the splitted name of the method.

Then, what is the sequence of triples that comes right after the name of the methods?

urialon commented 4 years ago

Hi @Ceceu , Thank you for your interest in code2vec!

Yes, test|field|injection|with|map is the method name, and then each triple is a single path. So for example, -400155226 is the hash of a path of nodes that connect the values bf and testbean.

See also the description here: https://github.com/tech-srl/code2vec#extending-to-other-languages

Best, Uri

celsofranssa commented 4 years ago

Hello @urialon I am very interested in code2vec, it's a really great step forward to code understanding.

I ended up interpreting the dataset format as you answered even though I didn't pay attention to the part of the README that you indicated (for my mistake).

So, during code2vec training is the path-context represented by this hash value instead of the path itself?

urialon commented 4 years ago

Yes, In code2vec, the hash is just to save space, because the entire path is treated as a single symbol. So it doesn't matter if we use the hash or the path itself.

In code2seq, we do not hash, because the model reads the path as a sequence of AST nodes.

I hope it helps, Uri

celsofranssa commented 4 years ago

Yes, In code2vec, the hash is just to save space, because the entire path is treated as a single symbol. So it doesn't matter if we use the hash or the path itself.

In code2seq, we do not hash, because the model reads the path as a sequence of AST nodes.

I hope it helps, Uri

Yes it helped a lot, thank you very much.