microsoft / tf-gnn-samples

TensorFlow implementations of Graph Neural Networks
MIT License
914 stars 229 forks source link

VarMisuse dataset split #4

Closed urialon closed 4 years ago

urialon commented 4 years ago

Hey @mmjb, how are you?

I am trying to play with training/testing of different architectures for VarMisuse, and I am trying to reproduce the exact dataset as in your experiments from the GNN-FiLM paper, such that the numbers will be comparable.

I downloaded and decompressed the dataset into data/varmisuse/. The code assumes that there is graphs-test subdirectory and a graphs-testonly directory, but in fact, the dataset is split to projects first, and then to train/dev/test.

I guess that I can re-split according to Appendix D in your ICLR'2018 paper, but there are a few inconsistencies between the Appendix and the actual dataset, for example: openlivewriter appears in the dataset but not in the paper, it only has a graphs subdir but not graphs-{train,test,valid}; hangfire, optikey (train) and ravendb(dev) are in the paper but not in the dataset; botbuilder is marked as "train" in the paper but its graphs-train subdir is empty (unlike the rest of the training projects).

So -

  1. Did you use the paper version of the dataset or the public version for the experiments reported in the GNN-FiLM paper?
  2. If the answer is "public" - then I should just re-organize the files to create the training, test, and test-only dirs? In this case, the "dev" set contains only a single project, right?
  3. Regarding file types - when I try to load a trained model and test it on a directory which contains only the file commandline.0.gz - I'm getting an error in richpath.py that "ValueError: File suffix must be .json, .json.gz, .pkl or .pkl.gz: data/varmisuse/commandline/graphs/commandline.0.gz".

Am I missing something? Maybe it will be easiest if you could share your file structure in the data/varmisuse dir and I'll organize my files accordingly?

Thanks a lot!

mmjb commented 4 years ago

Argh, this is a bit of a mess it seems - I used the data in a storage container which I believed to be the source of the zip that @mallamanis uploaded for the paper (with a bit of moving of data), but there seem to be a bunch of discrepancies, as you've observed. What I used seems to be a mix of the two datasets...

I will try to document this going forward and create a shell script to do the reformatting, but here's some notes from my investigation so far:

Overall, I believe that the following script should do the right thing:

OUTDIR="reorged-varmisuse-dataset"
CODEDIR="$HOME/AZProjects/tf-gnn-samples/"
TESTONLY_PROJS="commandline humanizer lean"

for fold in train valid test testonly; do
    mkdir -p "${OUTDIR}/graphs-${fold}-raw"
done

7za x graph-dataset.zip

for test_proj in $TESTONLY_PROJS; do
    mv graph-dataset/${test_proj}/graphs-test/* "${OUTDIR}/graphs-testonly-raw"
    rm -rf graph-dataset/${test_proj}
done

for fold in train valid test; do
    mv graph-dataset/*/graphs-${fold}/* "${OUTDIR}/graphs-${fold}-raw"
done

for file in "${OUTDIR}"/*/*.gz; do
    new_file=$(echo "${file}" | sed -e 's/.gz$/.json.gz/')
    mv "${file}" "${new_file}"
done

for fold in train valid test testonly; do
    python "$CODEDIR/utils/varmisuse_data_splitter.py" "${OUTDIR}/graphs-${fold}-raw/" "${OUTDIR}/graphs-${fold}/"
    rm -rf "${OUTDIR}/graphs-${fold}-raw/"
done

I still need to verify that this yields results that are identical to what I report in the README/paper, but at least the size of the data matches fairly well, so I'm hopeful.

mallamanis commented 4 years ago

Hi @urialon

Overall, we realized that the dataset originally used in the ICRL18 paper could not be fully released (for licensing reasons) and hence we had to remove parts from the public dataset. The last table in the appendix (page 17) has a rerun on the released dataset. Having said that, as @mmjb says, the GNN-FiLM paper runs everything on the public dataset and hence the results from there should be directly comparable.

urialon commented 4 years ago

Thanks a lot guys!

mmjb commented 4 years ago

[I want to keep this open to track progress on ensuring that the script above does the right thing, and then include the shell script in the repo for the next person asking this question...]

urialon commented 4 years ago

Currently it seems that the shell scripts works for me as expected, except that varmisuse_data_splitter.py takes hours even on a 64-cores machine.

Neuromancer42 commented 4 years ago

This script works for me too. The only little flaw is that it seems to be single-threaded (though I have 40 cores)

mmjb commented 4 years ago

OK, I did do some preliminary checks and things seem to work out well; this issue should be considered closed by the changes in 9a42024.

FWIW, the reason for doing things single-threaded is that there are a few of the original data files that are several GB unpacked (and due to the poor-in-hindsight-choice of storing things as JSON, they need to loaded entirely into memory); as these take a longer while to process as well, parallelisation runs the risk of triggering OOMs when several of these are handled by different workers. One of the things varmisuse_data_splitter.py does is (a) to store things in smaller files and (b) in JSONLines format, which allows to stream one datapoint at a time.