Closed urialon closed 5 years ago
Sorry, we didn't release the dataset here, only the tool to extract it, and will not be able to release the dataset. Roughly, the dataset is considered to be distribution of the original source code, and so we would need to get approval from our legal team for each of the (hundreds of) projects we used here...
Do you think that the ICLR'18 dataset here: https://aka.ms/iclr18-prog-graphs-dataset Is large enough to be useful? (useful == sensible enough to compare different models on)
(at the worst case I can create a new dataset, but then I will need to de-duplicate etc.)
That dataset is specialized towards the VarMisuse task (in that it only contains subgraphs centered around a hole into which a variable should go), so I don't think that it would work for other scenarios.
What task are you trying to evaluate? Generating source code?
Yes, basically the ExprGen task as in your paper.
We're currently implementing our approach inside your data extractor (and plugging it in like the PhogExtractor to make sure our model, your model, and your baselines run on the same holes).
I fear that's the sanest way of doing things. Sorry to not be more helpful here, but the legal requirements around a global megacorporation sometimes make certain things surprisingly hard...
Yes, I understand. Can we use the ICLR'18 dataset at all, as a temporary solution? Is it in raw ".cs" files or a specific preprocessed format?
It's in preprocessed JSON, essentially the output of the spin of the Extractor for VarMisuse...
Maybe you can release that dataset as raw files? So I can run my extractor on the same raw data?
@mallamanis, you dealt with the dataset release -- can we do that?
I can try to get to this, but I need a few days before I get to this...
@urialon if you want this sooner, it might be faster to do the following (and I will eventually do the same thing, since we've deleted all other data by now)
a) The ICLR18 paper, in Appendix D has the projects along with the git SHA we used.
b) Clone these projects and set the HEAD to that SHA.
c) The files in there repositories are the ones used in the extracted data.
d) For additional, filtering, in the .json
each entry has information about the file (filename
field) where the data was extracted from. Filter only the files that appear in any of the .json
s.
(if you do this, let us know)
That's great! Thanks guys!
Hey @mallamanis, I did what you suggested.
Can you please approve the following script? It does work, I just want to verify that I am taking the right graphs, jsons, repos, etc. https://gist.github.com/urialon/bae095ebd86a0411ee97883dfcb5ae5b
*.cs
files. When the repo was in the dataset but not in the paper (actually this was just OpenLiveWriter) I took the latest commit. Is there a SHA for that repo?*.gz
files in the graphs
dir.Sorry for the trouble, I just figured that it will be easier for you to review my code rather than writing it.
Thanks!
This looks great! And I believe that the script is correct.
Hope this helps... Sorry for making things so complicated. Releasing/redistributing code of various licenses requires a lot of legal effort for any company and some open-source licenses make things even harder.
Let us know if we can be of more help :)
Yes, thanks, it helps a lot!
Hi @mmjb , How are you?
Is the dataset used for the ICLR'19 paper available? I always thought that it is the same one as in the ICLR'18 paper, but I just saw in the papers that the ICLR'19 one is much larger.
Thanks!