microsoft / graph-based-code-modelling

Code for "Generative Code Modeling with Graphs" (ICLR'19)
MIT License
170 stars 38 forks source link

Where is the dataset? #3

Closed urialon closed 5 years ago

urialon commented 5 years ago

Hi @mmjb , How are you?

Is the dataset used for the ICLR'19 paper available? I always thought that it is the same one as in the ICLR'18 paper, but I just saw in the papers that the ICLR'19 one is much larger.

Thanks!

mmjb commented 5 years ago

Sorry, we didn't release the dataset here, only the tool to extract it, and will not be able to release the dataset. Roughly, the dataset is considered to be distribution of the original source code, and so we would need to get approval from our legal team for each of the (hundreds of) projects we used here...

urialon commented 5 years ago

Do you think that the ICLR'18 dataset here: https://aka.ms/iclr18-prog-graphs-dataset Is large enough to be useful? (useful == sensible enough to compare different models on)

(at the worst case I can create a new dataset, but then I will need to de-duplicate etc.)

mmjb commented 5 years ago

That dataset is specialized towards the VarMisuse task (in that it only contains subgraphs centered around a hole into which a variable should go), so I don't think that it would work for other scenarios.

What task are you trying to evaluate? Generating source code?

urialon commented 5 years ago

Yes, basically the ExprGen task as in your paper.

We're currently implementing our approach inside your data extractor (and plugging it in like the PhogExtractor to make sure our model, your model, and your baselines run on the same holes).

mmjb commented 5 years ago

I fear that's the sanest way of doing things. Sorry to not be more helpful here, but the legal requirements around a global megacorporation sometimes make certain things surprisingly hard...

urialon commented 5 years ago

Yes, I understand. Can we use the ICLR'18 dataset at all, as a temporary solution? Is it in raw ".cs" files or a specific preprocessed format?

mmjb commented 5 years ago

It's in preprocessed JSON, essentially the output of the spin of the Extractor for VarMisuse...

urialon commented 5 years ago

Maybe you can release that dataset as raw files? So I can run my extractor on the same raw data?

mmjb commented 5 years ago

@mallamanis, you dealt with the dataset release -- can we do that?

mallamanis commented 5 years ago

I can try to get to this, but I need a few days before I get to this...

@urialon if you want this sooner, it might be faster to do the following (and I will eventually do the same thing, since we've deleted all other data by now)

a) The ICLR18 paper, in Appendix D has the projects along with the git SHA we used. b) Clone these projects and set the HEAD to that SHA. c) The files in there repositories are the ones used in the extracted data. d) For additional, filtering, in the .json each entry has information about the file (filename field) where the data was extracted from. Filter only the files that appear in any of the .jsons.

(if you do this, let us know)

urialon commented 5 years ago

That's great! Thanks guys!

urialon commented 5 years ago

Hey @mallamanis, I did what you suggested.

Can you please approve the following script? It does work, I just want to verify that I am taking the right graphs, jsons, repos, etc. https://gist.github.com/urialon/bae095ebd86a0411ee97883dfcb5ae5b

Sorry for the trouble, I just figured that it will be easier for you to review my code rather than writing it.

Thanks!

mallamanis commented 5 years ago

This looks great! And I believe that the script is correct.

Hope this helps... Sorry for making things so complicated. Releasing/redistributing code of various licenses requires a lot of legal effort for any company and some open-source licenses make things even harder.

Let us know if we can be of more help :)

urialon commented 5 years ago

Yes, thanks, it helps a lot!