Differences on the Cora dataset

hechtlinger commented 6 years ago

The lables here at keras-gcn does not seem to corresponds with the labels of the gcn repository when you load the data. It's the same indices, but not the same values. Also if you sum y_train here, there aren't 20 labels per class.

Are the two datasets actually different as it seems? What's the reason for that and which one should we use to replicate the paper results?

tkipf commented 6 years ago

Thanks for commenting and apologies for the confusion. Only the ‘gcn’ repository was intended to reproduce the results of the paper and hosts the dataset splits that we used (which were introduced in the Planetoid paper).

This repository (keras-gcn) is not intended to replicate the results of the paper (some subtleties in Keras did not allow me to re-implement the model with the exact same details as in the paper). Also, the dataset loader here does not load the splits from the Planetoid paper but instead the original version of the Cora dataset- splits are then generated at random.

I will update the description to make this (important) point a bit clearer. Thanks for catching this!

haczqyf commented 6 years ago

Hi Thomas,

Thanks for providing such a well-written implementation of GCN. I have been working on studying GCN for a while. I am new to this area and have learned a lot from your paper and code.

In terms of the dataset loader in keras-gcn, I also would like to add a few remarks about which I was confused and have figured them out after some experiments just in case others might also have such questions.

I was wondering if the training set, validation set and test set are always the same subsets of the whole CORA, i.e., if the training sets are different when we run the code at different times. You mentioned the splits here are generated at random. In fact, the training set always takes the first 140 samples in the CORA which means that the training set is indeed fixed when we run the code at different times. Thus I think that it would be better to make it more clear about this phrase 'splits are then generated at random' to avoid ambiguity. If I understand wrong about this, please correct me.

Another remark is about one detail in the function 'encode_onehot' in utils.py. I find that the 'enumerate' function might assign different one hot vectors to classes when we run the code at different times. To be more clear, I have provide an example below.

# First time
classes = {'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}

for i, c in enumerate(classes):
    print(i)
    print(c)

# Output
0
Genetic_Algorithms
1
Theory
2
Probabilistic_Methods
3
Reinforcement_Learning
4
Case_Based
5
Neural_Networks
6
Rule_Learning

# Second time
classes = {'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}
for i, c in enumerate(classes):
    print(i)
    print(c)

# Output
0
Probabilistic_Methods
1
Rule_Learning
2
Neural_Networks
3
Genetic_Algorithms
4
Theory
5
Case_Based
6
Reinforcement_Learning

In this case, if we want to calculate the frequency distribution of each class in the training set, this might cause inconvenience. One potential solution for fixing the one hot vectors of classes I figure out is to slightly modify one line code in the 'encode_onehot' function:

def encode_onehot(labels):
    # classes = set(labels)
    classes = sorted(list(set(labels)))
    classes_dict = {c: np.identity(len(classes))[i, :] for i, c in enumerate(classes)}
    labels_onehot = np.array(list(map(classes_dict.get, labels)), dtype=np.int32)
    return labels_onehot

Finally, I am also thinking about how we can randomly split dataset into training set, validation set and test set. For example, in terms of training set, how we can randomly have 140 samples from the whole dataset as training set when we run the code at different times. The current 'get_splits' function split the dataset as defined below. The training set always takes the first 140 rows of dataset.

idx_train = range(140)
idx_val = range(200, 500)
idx_test = range(500, 1500)

I feel that the frequency distribution of different classes in the training set might affect the prediction accuracy, and it would make more sense that the training set, validation set and test set are randomly split from the whole dataset if we run a few times of code and calculate the average accuracy.

Best, Yifan

tkipf commented 6 years ago

Hi Yifan, Thanks for looking into this. I agree with all of your points. As mentioned previously this data loader is only meant as a 'quick and dirty' example to show how data can be loaded into the model. For the reproducible dataset splits used in our paper, please have a look at https://github.com/tkipf/gcn

I have updated the project readme with a big warning to hopefully avoid confusion about this in the future.

tkipf / keras-gcn

Differences on the Cora dataset #24