sunfanyunn / InfoGraph

Official code for "InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization" (ICLR 2020, spotlight)
https://openreview.net/forum?id=r1lfF2NYvH
311 stars 45 forks source link

"ZeroDivisionError: float division by zero" when testing with dataset without node labels, like IMDB-BINARY #5

Closed HennyJie closed 3 years ago

HennyJie commented 4 years ago

Thanks for your work first!

I am testing on unsupervised graph classification. I saw in the paper experiments on 6 well known dataset: MUTAG, PTC, REDDIT-BINARY, REDDIT-5K, IMDB-BINARY and IMDB-MULTI were conducted. Among these datasets some have node/edge labels (like MUTAG), while some just contain necessarily three files A.txt, graph_indicator.txt and graph_labels.txt (like IMDB-BINARY, IMDB-MULTI). For dataset with node labels, I see the code initialized node feature h(0) as a one-hot vector from the node label. While when I was testing with dataset without node labels, I met the "float division by zero" error when initializing the Linear() in the first gc_layer in the Encoder class of 'gin.py'.

I tried to debug based on current code and found that this is because the input 'num_feature' is 0 when there are no node labels provided. I am wondering how do you initialize the node features on the dataset without node labels, for example using the node degree instead? I also tried to find additional parameters and different handling this exceptions, while I didn't find it. I am wondering could you please possible give some suggestions or see if there is any modification that should be made when testing on dataset without labels, based on the current code? Thanks!

joshr17 commented 3 years ago

Thanks for your work first!

I am testing on unsupervised graph classification. I saw in the paper experiments on 6 well known dataset: MUTAG, PTC, REDDIT-BINARY, REDDIT-5K, IMDB-BINARY and IMDB-MULTI were conducted. Among these datasets some have node/edge labels (like MUTAG), while some just contain necessarily three files A.txt, graph_indicator.txt and graph_labels.txt (like IMDB-BINARY, IMDB-MULTI). For dataset with node labels, I see the code initialized node feature h(0) as a one-hot vector from the node label. While when I was testing with dataset without node labels, I met the "float division by zero" error when initializing the Linear() in the first gc_layer in the Encoder class of 'gin.py'.

I tried to debug based on current code and found that this is because the input 'num_feature' is 0 when there are no node labels provided. I am wondering how do you initialize the node features on the dataset without node labels, for example using the node degree instead? I also tried to find additional parameters and different handling this exceptions, while I didn't find it. I am wondering could you please possible give some suggestions or see if there is any modification that should be made when testing on dataset without labels, based on the current code? Thanks!

I am also having the exact same issue and would be very interested to hear of any thoughts you might have on solving this problem!

Best

Josh

HennyJie commented 3 years ago

Hi joshr,

Yes, I can provide one possible fixing solution here. For the dataset without node label, I used the one-hot encoding of the the node's out-degree as node feature. Simply add the code below of loading dataset:

if dataset.data.x is None:
    max_degree = 0
    degs = []
    for data in dataset:
        degs += [degree(data.edge_index[0], dtype=torch.long)]
        max_degree = max(max_degree, degs[-1].max().item())

    if max_degree < 1000:
        dataset.transform = T.OneHotDegree(max_degree)
    else:
        deg = torch.cat(degs, dim=0).to(torch.float)
        mean, std = deg.mean().item(), deg.std().item()
        dataset.transform = NormalizedDegree(mean, std)

I tried on dataset without labels like IMDB-BINARY and it can run after this modification, though I am still not clear how the author deals with this issue and receive no reply yet.

HennyJie commented 3 years ago

Really hope that the authors could give your solutions. : )

joshr17 commented 3 years ago

Thanks so much, that works perfectly. Indeed it would be interesting to know what they did.

In case anyone else comes across this thread, you will also need the following imports as well as the snippet HennyJie provided :)

from torch_geometric.utils import degree import torch_geometric.transforms as T

sunfanyunn commented 3 years ago

I believe what I did was simply setting dataset_num_features to 1 when there are no node features and the code would run just fine. I just updated my repo. Thanks!

ha-lins commented 3 years ago

Update: Pls remove this line and then it works fine.

=============== It seems that the new commit doesn't work well (for the unsupervised setting), since the process always stopped here (though without any error messages) and didn't move on when I ran the main.py with any datasets:

================
lr: 0.001
num_features: 1
hidden_dim: 32
num_gc_layers: 3
================
1

Could you pls check the commit again? Thanks a lot!