Reproduce the results from the RDF2Vec paper

ritatsousa commented 4 years ago

Hello, I am trying to reproduce the results from the RDF2Vec paper using your implementation. More specifically, I am using the generated embeddings for the AIFB dataset for classification with SVM, C4.5, Naive Bayes, and KNN. However, I obtained results considerably different from the ones presented in RDF2Vec. Have you tried to reproduce the results? Is it possible that these differences can be due to differences in implementation? Thanks

GillesVandewiele commented 4 years ago

Hi,

Thanks for showing interest in this repository! What exactly are the results you are achieving? And which parameters are you exactly using? The depth of the walks, the dimension of the embedding and the number of walks extracted per walk can have quite an impact. I haven't looked that closely into exactly reproducing the numbers, but I think I did get quite good performances on both the AIFB and BGS dataset using this implementation.

Moreover, are you using scikit-learn implementation for the ML classifiers? These can have different hyper-parameter settings than the WEKA ones which were used in the original paper.

GillesVandewiele commented 4 years ago

Also, it should perhaps be noted that there are also small difference between the results reported in the original paper, and from the paper of the authors of the relational graph CNNs

https://arxiv.org/pdf/1703.06103.pdf and http://www.semantic-web-journal.net/system/files/swj1495.pdf

GillesVandewiele commented 4 years ago

Ok so I tried reproducing the results. I did notice a bug that when the max_walks_per_graph is set to float('inf') which should be equivalent to extracting all possible walks, then it gets stuck in an infinite loop somewhere which I will have a look at later.

I extracted 5000 walks per graph; no weisfeiler-lehman kernel; and all the rest default hyper-parameters, and obtained the following results:

Random Forest:
0.8055555555555556
[[13  1  1  0]
 [ 0  4  2  0]
 [ 2  0 10  0]
 [ 1  0  0  2]]
Support Vector Machine:
0.8333333333333334
[[13  1  1  0]
 [ 0  4  1  1]
 [ 1  0 11  0]
 [ 1  0  0  2]]
Gaussian Naive Bayes:
0.8611111111111112
[[12  1  2  0]
 [ 0  5  1  0]
 [ 1  0 11  0]
 [ 0  0  0  3]]
CART Decision Tree:
0.6666666666666666
[[12  1  1  1]
 [ 0  3  2  1]
 [ 1  2  8  1]
 [ 0  1  1  1]]
K-Nearest Neighbors:
0.8055555555555556
[[13  0  1  1]
 [ 0  5  1  0]
 [ 3  0  8  1]
 [ 0  0  0  3]]

Seems like most classifiers perform better, except the SVM which doesn't reach 90% at all and the decision tree which fails miserably (but it is CART (without post-pruning) which is a different algorithm than C4.5)

EDIT: Here's the code:

import rdflib
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

from collections import defaultdict, Counter

from graph import *
from rdf2vec import RDF2VecTransformer

import warnings
warnings.filterwarnings('ignore')

print(end='Loading data... ', flush=True)
g = rdflib.Graph()
g.parse('../data/aifb.n3', format='n3')
print('OK')

test_data = pd.read_csv('../data/AIFB_test.tsv', sep='\t')
train_data = pd.read_csv('../data/AIFB_train.tsv', sep='\t')

train_people = [rdflib.URIRef(x) for x in train_data['person']]
train_labels = train_data['label_affiliation']

test_people = [rdflib.URIRef(x) for x in test_data['person']]
test_labels = test_data['label_affiliation']

label_predicates = [
    rdflib.URIRef('http://swrc.ontoware.org/ontology#affiliation'),
    rdflib.URIRef('http://swrc.ontoware.org/ontology#employs'),
    rdflib.URIRef('http://swrc.ontoware.org/ontology#carriedOutBy')
]

# Extract the train and test graphs

kg = rdflib_to_kg(g, label_predicates=label_predicates)

train_graphs = [extract_instance(kg, person) for person in train_people]
test_graphs = [extract_instance(kg, person) for person in test_people]

transformer = RDF2VecTransformer(_type='walk', walks_per_graph=5000)
embeddings = transformer.fit_transform(train_graphs + test_graphs)

train_embeddings = embeddings[:len(train_graphs)]
test_embeddings = embeddings[len(train_graphs):]

rf =  RandomForestClassifier(n_estimators=100)
rf.fit(train_embeddings, train_labels)

print('Random Forest:')
print(accuracy_score(test_labels, rf.predict(test_embeddings)))
print(confusion_matrix(test_labels, rf.predict(test_embeddings)))

clf =  GridSearchCV(SVC(), {'C': [10**i for i in range(-3, 4)]})
clf.fit(train_embeddings, train_labels)

print('Support Vector Machine:')
print(accuracy_score(test_labels, clf.predict(test_embeddings)))
print(confusion_matrix(test_labels, clf.predict(test_embeddings)))

clf =  GaussianNB()
clf.fit(train_embeddings, train_labels)

print('Gaussian Naive Bayes:')
print(accuracy_score(test_labels, clf.predict(test_embeddings)))
print(confusion_matrix(test_labels, clf.predict(test_embeddings)))

clf =  DecisionTreeClassifier()
clf.fit(train_embeddings, train_labels)

print('CART Decision Tree:')
print(accuracy_score(test_labels, clf.predict(test_embeddings)))
print(confusion_matrix(test_labels, clf.predict(test_embeddings)))

clf =  KNeighborsClassifier(n_neighbors=3)
clf.fit(train_embeddings, train_labels)

print('K-Nearest Neighbors:')
print(accuracy_score(test_labels, clf.predict(test_embeddings)))
print(confusion_matrix(test_labels, clf.predict(test_embeddings)))

predict-idlab / pyRDF2Vec

Reproduce the results from the RDF2Vec paper #1