VarMisuse dataset split

urialon commented 4 years ago

Hey @mmjb, how are you?

I am trying to play with training/testing of different architectures for VarMisuse, and I am trying to reproduce the exact dataset as in your experiments from the GNN-FiLM paper, such that the numbers will be comparable.

I downloaded and decompressed the dataset into data/varmisuse/. The code assumes that there is graphs-test subdirectory and a graphs-testonly directory, but in fact, the dataset is split to projects first, and then to train/dev/test.

I guess that I can re-split according to Appendix D in your ICLR'2018 paper, but there are a few inconsistencies between the Appendix and the actual dataset, for example: openlivewriter appears in the dataset but not in the paper, it only has a graphs subdir but not graphs-{train,test,valid}; hangfire, optikey (train) and ravendb(dev) are in the paper but not in the dataset; botbuilder is marked as "train" in the paper but its graphs-train subdir is empty (unlike the rest of the training projects).

So -

Did you use the paper version of the dataset or the public version for the experiments reported in the GNN-FiLM paper?
If the answer is "public" - then I should just re-organize the files to create the training, test, and test-only dirs? In this case, the "dev" set contains only a single project, right?
Regarding file types - when I try to load a trained model and test it on a directory which contains only the file commandline.0.gz - I'm getting an error in richpath.py that "ValueError: File suffix must be .json, .json.gz, .pkl or .pkl.gz: data/varmisuse/commandline/graphs/commandline.0.gz".

Am I missing something? Maybe it will be easiest if you could share your file structure in the data/varmisuse dir and I'll organize my files accordingly?

Thanks a lot!

mmjb commented 4 years ago

Argh, this is a bit of a mess it seems - I used the data in a storage container which I believed to be the source of the zip that @mallamanis uploaded for the paper (with a bit of moving of data), but there seem to be a bunch of discrepancies, as you've observed. What I used seems to be a mix of the two datasets...

I will try to document this going forward and create a shell script to do the reformatting, but here's some notes from my investigation so far:

unzip(1) fails to unpack the zip file, but 7za x graph-dataset.zip works.
graphs-testonly should consist of commandline, humanizer and lean (this is correct).
I don't understand what the "Dev" marker in the paper means - ignore it. (@mallamanis, do you remember any significance of this - the projects have train/valid/test files in the released dataset)
Table 4 in Appendix D of the paper shows all projects; we only (re-)released those that were not under the GPL, due to licensing concerns of MSR legal. That explains the missing projects (hangfire, optikey, ravendb).
There's some reorg necessary to correctly handle the testonly fold: Switching from a $project/graphs-$fold format to a $graphs-fold/$project format.
The case of botbuilder and OpenLiveWriter is bizarre, but it seems that both were simply disregarded in our actual data due to human error...
As the files in the released dataset are rather large and in an inconvenient format, I used utils/varmisuse_data_splitter.py (just committed) to split these into smaller files in JSONL format.

Overall, I believe that the following script should do the right thing:

OUTDIR="reorged-varmisuse-dataset"
CODEDIR="$HOME/AZProjects/tf-gnn-samples/"
TESTONLY_PROJS="commandline humanizer lean"

for fold in train valid test testonly; do
    mkdir -p "${OUTDIR}/graphs-${fold}-raw"
done

7za x graph-dataset.zip

for test_proj in $TESTONLY_PROJS; do
    mv graph-dataset/${test_proj}/graphs-test/* "${OUTDIR}/graphs-testonly-raw"
    rm -rf graph-dataset/${test_proj}
done

for fold in train valid test; do
    mv graph-dataset/*/graphs-${fold}/* "${OUTDIR}/graphs-${fold}-raw"
done

for file in "${OUTDIR}"/*/*.gz; do
    new_file=$(echo "${file}" | sed -e 's/.gz$/.json.gz/')
    mv "${file}" "${new_file}"
done

for fold in train valid test testonly; do
    python "$CODEDIR/utils/varmisuse_data_splitter.py" "${OUTDIR}/graphs-${fold}-raw/" "${OUTDIR}/graphs-${fold}/"
    rm -rf "${OUTDIR}/graphs-${fold}-raw/"
done

I still need to verify that this yields results that are identical to what I report in the README/paper, but at least the size of the data matches fairly well, so I'm hopeful.

mallamanis commented 4 years ago

Hi @urialon

Apologies for splitting the files per-project first. This is necessary for legal reasons so that the appropriate license is at the root of the graphs of each project. Merging all the *-train, *-valid, etc. directories should give the right results.
The "dev" annotation in the ICLR18 paper indicates that we used those projects during the development phase, while we debugged the model/extraction. If I recall correctly, these projects were placed in the training/validation set in the final runs.
If I recall correctly the openlivewriter, botbuilder were not extracted correctly and the graphs were not actually used. Probably, I was not careful enough and some placeholder files appeared in the full graph dataset but not the split ones?

Overall, we realized that the dataset originally used in the ICRL18 paper could not be fully released (for licensing reasons) and hence we had to remove parts from the public dataset. The last table in the appendix (page 17) has a rerun on the released dataset. Having said that, as @mmjb says, the GNN-FiLM paper runs everything on the public dataset and hence the results from there should be directly comparable.

urialon commented 4 years ago

Thanks a lot guys!

mmjb commented 4 years ago

[I want to keep this open to track progress on ensuring that the script above does the right thing, and then include the shell script in the repo for the next person asking this question...]

urialon commented 4 years ago

Currently it seems that the shell scripts works for me as expected, except that varmisuse_data_splitter.py takes hours even on a 64-cores machine.

Neuromancer42 commented 4 years ago

This script works for me too. The only little flaw is that it seems to be single-threaded (though I have 40 cores)

mmjb commented 4 years ago

OK, I did do some preliminary checks and things seem to work out well; this issue should be considered closed by the changes in 9a42024.

FWIW, the reason for doing things single-threaded is that there are a few of the original data files that are several GB unpacked (and due to the poor-in-hindsight-choice of storing things as JSON, they need to loaded entirely into memory); as these take a longer while to process as well, parallelisation runs the risk of triggering OOMs when several of these are handled by different workers. One of the things varmisuse_data_splitter.py does is (a) to store things in smaller files and (b) in JSONLines format, which allows to stream one datapoint at a time.

microsoft / tf-gnn-samples

VarMisuse dataset split #4