How to train the encoder for our own data? (A Knowledge graph and sample query)

rd27995 commented 3 years ago

Hi,

I have a target graph in the form of a directed networkx graphs with 14M nodes and 54M edges. I wanted to know how can I make use of this target graph along with another query graph (of size 30 Nodes 33 Edges) to train the encoder?

I can only see options to make use of inbuilt datasets in PyTorch gemetric. Is there any simpler way I can use my own datasets?

jessxphil commented 3 years ago

I have the same question.

sML-90 commented 3 years ago

+1

qema commented 3 years ago

Thanks for the question and sorry for the late reply. There is not currently a user-facing mechanism to incorporate custom datasets due to the need to define things like train/test split and subgraph sampling -- in general one can create a new DataSource (see common/data.py) to handle new datasets. Note that a pretrained model (such as the one provided in the repo) may be able to handle testing on new datasets, in which case subgraph_matching/alignment.py can load in new graphs to evaluate on.

If the goal is to train on new datasets, as a bit of a hack, one could append an "elif" after this line: https://github.com/snap-stanford/neural-subgraph-learning-GNN/blob/4d074cbc0fa9d81defef746302e62b1b9a97791d/common/data.py#L55

with a spec for a new dataset: elif name == 'newdataset': dataset = [list of networkx or pytorch geometric graphs]

and train using the command line option --dataset=newdataset-balanced and test with --dataset=newdataset-imbalanced.

rd27995 commented 3 years ago

Thanks @qema, I was able to train the network using my custom datasets, however, I get only around 70 % validation accuracy. Any suggestions to improve the model accuracy or finetune it? I am using all default model parameters. The second plot depicts validation metrics.

Training_Metrics_500_samples_300_nodes_each

Val_Results_100_Epochs_500_Samples_300_nodes

qema commented 3 years ago

Hi @rd27995, please see the new experimental branch which supports node features and harder negative sampling. For now, the above procedure to add new datasets is still needed. However, one can now train with --dataset=newdataset-basis and test with --dataset=newdataset-imbalanced (-basis being the new data source with harder negative examples). Also, note that testing on the imbalanced dataset (which samples random pairs of graphs) may give a more realistic picture of model performance than validation (which uses an artificial 50-50 label split as well as artificially-generated negative examples).

snap-stanford / neural-subgraph-learning-GNN

How to train the encoder for our own data? (A Knowledge graph and sample query) #16