tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Pretrained model for Python #134

Closed Avv22 closed 3 years ago

Avv22 commented 3 years ago

Hello,

I have a bunch of Python ASTs and Java ASTs in the following format:

[{"id": 0, "type": "Module", "children": [1, 7, 19, 22, 38]}, {"id": 1, "type": "Assign", "children": [2, 3]}, {"id": 2, "type": "NameStore", "value": "S"}, {"id": 3, "type": "Call", "children": [4, 5]}, {"id": 4, "type": "NameLoad", "value": "list"}, {"id": 5, "type": "Call", "children": [6]}, {"id": 6, "type": "NameLoad", "value": "input"}, {"id": 7, "type": "Assign", "children": [8, 9]}, {"id": 8, "type": "NameStore", "value": "a"}, {"id": 9, "type": "Call", "children": [10, 11]}, {"id": 10, "type": "NameLoad", "value": "list"}, {"id": 11, "type": "Call", "children": [12, 13, 14]}, {"id": 12, "type": "NameLoad", "value": "map"}, {"id": 13, "type": "NameLoad", "value": "int"}, {"id": 14, "type": "Call", "children": [15]}, {"id": 15, "type": "AttributeLoad", "children": [16, 18]}, {"id": 16, "type": "Call", "children": [17]}, {"id": 17, "type": "NameLoad", "value": "input"}, {"id": 18, "type": "attr", "value": "split"}, {"id": 19, "type": "Assign", "children": [20, 21]}, {"id": 20, "type": "NameStore", "value": "factor"}, {"id": 21, "type": "Num", "value": "0"}, {"id": 22, "type": "For", "children": [23, 24, 25]}, {"id": 23, "type": "NameStore", "value": "tmp"}, {"id": 24, "type": "NameLoad", "value": "a"}, {"id": 25, "type": "body", "children": [26, 35]}, {"id": 26, "type": "Expr", "children": [27]}, {"id": 27, "type": "Call", "children": [28, 31, 34]}, {"id": 28, "type": "AttributeLoad", "children": [29, 30]}, {"id": 29, "type": "NameLoad", "value": "S"}, {"id": 30, "type": "attr", "value": "insert"}, {"id": 31, "type": "BinOpAdd", "children": [32, 33]}, {"id": 32, "type": "NameLoad", "value": "tmp"}, {"id": 33, "type": "NameLoad", "value": "factor"}, {"id": 34, "type": "Str", "value": "\\""}, {"id": 35, "type": "AugAssignAdd", "children": [36, 37]}, {"id": 36, "type": "NameStore", "value": "factor"}, {"id": 37, "type": "Num", "value": "1"}, {"id": 38, "type": "Expr", "children": [39]}, {"id": 39, "type": "Call", "children": [40, 41]}, {"id": 40, "type": "NameLoad", "value": "print"}, {"id": 41, "type": "Call", "children": [42, 45]}, {"id": 42, "type": "AttributeLoad", "children": [43, 44]}, {"id": 43, "type": "Str", "value": ""}, {"id": 44, "type": "attr", "value": "join"}, {"id": 45, "type": "NameLoad", "value": "S"}]'

How can I get their embeddings with your model please? Is their already trained model that I can used directly to output embeddings similar to your trained model for Java please or I should train the model from scratch for Python? If yes, can you please show how to start that?

urialon commented 3 years ago

Hi Avra, Thank you for your interest in this work! Sorry again for the delayed response.

Yes, you will need to train the model from scratch for Python. See: https://github.com/tech-srl/code2vec#extending-to-other-languages

As for Java, you will either need to extract paths from your ASTs that are in the same format as our data. Otherwise, you can de-serialize your ASTs (convert them back to code), and run our JavaExtractor on the produced code.

Best, Uri

Avv22 commented 3 years ago

@urialon. Okay! So the above AST sample for one Python code does not work? I have to use your java extractor and astminer on my code samples to train them on code2vec please?

urialon commented 3 years ago

Correct.