tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Original dataset with unprocessed `.java` sources #160

Closed abitrolly closed 2 years ago

abitrolly commented 2 years ago

I downloaded 6Gb java14m_data.tar.gz from the README.md and it doesn't contains original source files to reconstruct the model from scratch.

$ tar xzvf java14m_data.tar.gz 
data/
data/java14m/
data/java14m/java14m.val.c2v
data/java14m/java14m.test.c2v
data/java14m/java14m.train.c2v
data/java14m/java14m.dict.c2v

So is there a dataset with the original .java files?

urialon commented 2 years ago

Hi @abitrolly , Thank you for your interest in our work!

Yes, The sources of this exact same dataset are not available, but a very similar dataset from the [code2seq paper]() is available here: https://github.com/tech-srl/code2vec#java-large-compressed-72gb-extracted-37gb

Best, Uri

abitrolly commented 2 years ago

Got it.