Preprocessing adjacency matrix

leowyy commented 5 years ago

Hi Thomas,

I'm slightly puzzled by a portion of code used for data preprocessing. In utils.load_data(), I noticed that the indices of features and labels are reordered according to the pre-saved test indices. However, the indices of the adjacency matrix are not reordered. Is this because the indices of the adjacency matrix were already reordered in your saved files? Lines 69-73

    features[test_idx_reorder, :] = features[test_idx_range, :]
    adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))

    ...
    labels[test_idx_reorder, :] = labels[test_idx_range, :]

Thank you!

tkipf commented 5 years ago

Good catch. This might indeed be a bug, but given that this data loader has been used as a de facto benchmark now in quite a number of works (and also by predecessor work by Yang et al., ICML 2016), it is best to just keep it as is if you’d like to compare numbers.

If you’re interested in a more instructive/flexible data loader for these citation network datasets, have a look at: https://github.com/tkipf/pygcn https://github.com/tkipf/pygcn. Note that the dataset splits are different here, but this allows you to choose your own splits/run your own evaluation protocols more easily (e.g. test on different random splits etc.). In general I’d recommend against using the fixed split from Yang et al.’s paper (which we also use in this repo) these days, as it is very easy to overfit on the validation set, which affects many recent works.

On 24 Dec 2018, at 09:25, Leow Yao Yang notifications@github.com wrote:

Hi Thomas,

During data preprocessing in utils.load_data(), I noticed that the indices of features and labels are reordered according to the pre-saved test indices. However, the indices of the adjacency matrix are not reordered. Is this because the indices of the adjacency matrix were already reordered in your saved files? Lines 69-73
features[test_idx_reorder, :] = features[test_idx_range, :]
adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))

...
labels[test_idx_reorder, :] = labels[test_idx_range, :]
Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/76, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYMIon5mfjw1wBsEjS7Hd-0M1apfTks5u8I-IgaJpZM4ZgPPw.

leowyy commented 5 years ago

Thanks for your response; I'll take a look at the data loader in the repo you've mentioned.

Merry Christmas!

DavideBuffelli commented 5 years ago

I know this issue is closed but I think it could be very useful to present a fix for this bug, specially for future research. I think this should work:

....

features[test_idx_reorder, :] = features[test_idx_range, :]
G = nx.from_dict_of_lists(graph)
G = nx.relabel_nodes(G, {i: j for i,j in zip(test_idx_reorder, test_idx_range)})
adj = nx.adjacency_matrix(G)

...

tkipf / gcn

Preprocessing adjacency matrix #76