mcanueste / ddgk-torch

Implementation of DDGK using PyTorch
Apache License 2.0
3 stars 1 forks source link

how to use the graph representation #1

Closed Peien429 closed 2 years ago

Peien429 commented 2 years ago

Hi! I have read the paper but I don't understand how to use the targe graph representation. After I got the target graph representation(each dimension is a divergence score between the traget graph and a source graph), could I just use the graph representation as the input of the SVM classifier? I found the code as below: distance.squareform(distance.pdist(scores_np, metric="euclidean")) Why not just use the "scores_np" as the input of the SVM classifier? Thanks a lot!

mcanueste commented 2 years ago

Hi @Peien429!

Yes, it is exactly you mentioned. I think you can directly use the embeddings with other methods, i.e. input to other ML methods.

The embeddings are created for the target graphs based on their 'similarity' to set of source graphs, with each target-source 'similarity' defining the dimensions of the embedded vector. By 'similarity', it is meant that how well the augmented source encoder predicts the neighboring nodes (in other words, the edges) for each node in the target graph with the help of the attention and reverse-attention layers.

You can maybe compare this method to localization by triangulation method in 3D space. Instead of using 3 reference points, you use some set of source graphs, and instead of using Euclidean distance, you use the Attention -> Encoder -> Reverse Attention model's loss as the distance. Basically, this approach tries to define some sort of metric space for graphs, which is not possible to define otherwise due to isomorphism and node/edge attributes.

However, this method of calculating 'distance' might not necessarily define a metric space formally (i.e. metric space properties such as symmetry, triangle inequality etc. might not hold for this approach), but you can ask the authors or some other expert about this as I am not well versed in math.

This might be the reason why the authors used pairwise Euclidean distances of the embeddings for clustering (mentioned in their paper in section 5.2, and their implementation is here). Another reason might be to just having a simpler space to work with instead of using full embedding space in order to reduce complexity. Regardless, I used the original approach in this implementation.

As a separate note, this PyTorch implementation is not thoroughly tested, so in case you have unexpected results, you might want to refer to the original implementation as well as the paper. I used this PyTorch implementation with some changes on another project with a different problem. Unfortunately, that work is not public, and I don't have enough time to thoroughly test this repository in the near future.