microsoft / tf-gnn-samples

TensorFlow implementations of Graph Neural Networks
MIT License
914 stars 229 forks source link

Raw programs for the VarMisuse task #10

Closed ajayjain closed 4 years ago

ajayjain commented 4 years ago

This is a great project! I'd like to run experiments on the VarMisuse task, and ideally on other related tasks like variable naming. The training process works fine for me, but how can I access the raw C# programs used to create the dataset? Alternatively, is there code for creating the graphs from source C# programs? I'd like to construct different graph structures, e.g. by preprocessing the programs, performing program analyses etc.

I tried to reconstruct the programs using the 'ContextGraph' property on samples in the dataset, but the programs don't seem to be correct (e.g. only 2 blocks are closed for an Akka program).

In [43]: program = ""
    ...: for u, v in raw_sample['ContextGraph']['Edges']['NextToken']:
    ...:     label = raw_sample['ContextGraph']['NodeLabels'][str(u)]
    ...:     program += label
    ...:     if label == ";":
    ...:         program += "\n"
    ...: program += raw_sample['ContextGraph']['NodeLabels'][str(v)]
    ...: print(program)
(,value)<SLOT>valuevaluevaluekvkvkvkv_clusterbucketvbucketkeyv_cluster_cluster_cluster_cluster_clusterbucketbucketbucketvkeykeyContext_cluster_cluster_cluster_cluster_cluster_cluster_cluster_cluster_cluster_clusterbucketValueHolder,v,=vvar,)vvdeltaContent=bucketkvkvcurrentkvcurrentkvkv>&&=entryvar.entryentry.entryentryentry.=>entry(.)(=>kv.Count;
Sender.SenderTell(count)countvar=_registry._registry_registry_registrySum{=bucketbucketvar{)foreach(varentryin_registrykvin.{_registryvartopicPrefix=Self.SelfSelf=Key;
...

Thanks!

mmjb commented 4 years ago

The raw C# programs in the VarMisuse task are not included in the dataset, for a variety of licensing problems (not that it's impossible, but that we as Microsoft would need to go through a lot of double-checking legal bits before doing so). The included graphs are actually truncated subgraphs of the full program graphs, and hence there is no reliable way to reconstruct the programs from them.

Hence, the process of getting to them is a bit complicated: In the paper presenting the dataset (https://arxiv.org/pdf/1711.00740.pdf), Table 4 (in the appendix) lists all used projects and the SHA of the commit we used for the extraction. Furthermore, https://github.com/microsoft/graph-based-code-modelling has a related data extractor (derived from the one we used for the VarMisuse paper) which would help you to extract similar data, and would be a starting point to add more analyses.