A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. This study surveys 23 recent embedding-based entity alignment approaches and categorizes them based on their techniques and characteristics. We further observe that current approaches use different datasets in evaluation, and the degree distributions of entities in these datasets are inconsistent with real KGs. Hence, we propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. This study also produces an open-source library, which includes 12 representative embedding-based entity alignment approaches. We extensively evaluate these approaches on the generated datasets, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.

Key contributors ✨

Zequn Sun (NJU)

Wei Hu (NJU)

Muhao Chen (UC Davis)

Haofen Wang (TONGJI)

UPDATE

Aug. 1, 2021: We release the source code for entity alignment with dangling cases.
June 29, 2021: We release the DBP2.0 dataset for entity alignment with dangling cases.
Jan. 8, 2021: The results of AliNet on OpenEA datasets are avaliable at Google docs.
Nov. 30, 2020: We release a new version (v2.0) of the OpenEA dataset, where the URIs of DBpedia and YAGO entities are encoded to resovle the name bias issue. It is strongly recommended to use the v2.0 dataset for evaluating attribute-based entity alignment methods, such that the results can better reflect the robustness of these methods in real-world situation.
Sep. 24, 2020: add AliNet.

Library for Embedding-based Entity Alignment
1. Overview
2. Getting Started
KG Sampling Method and Datasets
Experiment and Results
1. Experiment Settings
2. Detailed Results
License
Citation

Library for Embedding-based Entity Alignment

Overview

We use Python and Tensorflow to develop an open-source library, namely OpenEA, for embedding-based entity alignment. The software architecture is illustrated in the following Figure.

The design goals and features of OpenEA include three aspects, i.e., loose coupling, functionality and extensibility, and off-the-shelf solutions.

Loose coupling. The implementations of embedding and alignment modules are independent to each other. OpenEA provides a framework template with pre-defined input and output data structures to make the three modules as an integral pipeline. Users can freely call and combine different techniques in these modules.
Functionality and extensibility. OpenEA implements a set of necessary functions as its underlying components, including initialization functions, loss functions and negative sampling methods in the embedding module; combination and learning strategies in the interaction mode; as well as distance metrics and alignment inference strategies in the alignment module. On top of those, OpenEA also provides a set of flexible and high-level functions with configuration options to call the underlying components. In this way, new functions can be easily integrated by adding new configuration options.
Off-the-shelf solutions. To facilitate the use of OpenEA in diverse scenarios, we try our best to integrate or re-build a majority of existing embedding-based entity alignment approaches. Currently, OpenEA has integrated the following embedding-based entity alignment approaches:
1. MTransE: Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment. IJCAI 2017.
2. IPTransE: Iterative Entity Alignment via Joint Knowledge Embeddings. IJCAI 2017.
3. JAPE: Cross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding. ISWC 2017.
4. KDCoE: Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment. IJCAI 2018.
5. BootEA: Bootstrapping Entity Alignment with Knowledge Graph Embedding. IJCAI 2018.
6. GCN-Align: Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks. EMNLP 2018.
7. AttrE: Entity Alignment between Knowledge Graphs Using Attribute Embeddings. AAAI 2019.
8. IMUSE: Unsupervised Entity Alignment Using Attribute Triples and Relation Triples. DASFAA 2019.
9. SEA: Semi-Supervised Entity Alignment via Knowledge Graph Embedding with Awareness of Degree Difference. WWW 2019.
10. RSN4EA: Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs. ICML 2019.
11. MultiKE: Multi-view Knowledge Graph Embedding for Entity Alignment. IJCAI 2019.
12. RDGCN: Relation-Aware Entity Alignment for Heterogeneous Knowledge Graphs. IJCAI 2019.
13. AliNet: Knowledge Graph Alignment Network with Gated Multi-hop Neighborhood Aggregation. AAAI 2020.
OpenEA has also integrated the following relationship embedding models and two attribute embedding models (AC2Vec and Label2vec ) in the embedding module:
1. TransH: Knowledge Graph Embedding by Translating on Hyperplanes. AAAI 2014.
2. TransR: Learning Entity and Relation Embeddings for Knowledge Graph Completion. AAAI 2015.
3. TransD: Knowledge Graph Embedding via Dynamic Mapping Matrix. ACL 2015.
4. HolE: Holographic Embeddings of Knowledge Graphs. AAAI 2016.
5. ProjE: ProjE: Embedding Projection for Knowledge Graph Completion. AAAI 2017.
6. ConvE: Convolutional 2D Knowledge Graph Embeddings. AAAI 2018.
7. SimplE: SimplE Embedding for Link Prediction in Knowledge Graphs. NeurIPS 2018.
8. RotatE: RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space. ICLR 2019.

Getting Started

These instructions cover how to get a copy of the library and how to install and run it on your local machine for development and testing purposes. It also provides an overview of the package structure of the source code.

Package Description

src/
├── openea/
│   ├── approaches/: package of the implementations for existing embedding-based entity alignment approaches
│   ├── models/: package of the implementations for unexplored relationship embedding models
│   ├── modules/: package of the implementations for the framework of embedding module, alignment module, and their interaction
│   ├── expriment/: package of the implementations for evalution methods

Dependencies

Python 3.x (tested on Python 3.6)
Tensorflow 1.x (tested on Tensorflow 1.8 and 1.12)
Scipy
Numpy
Graph-tool or igraph or NetworkX
Pandas
Scikit-learn
Matching==0.1.1
Gensim

Installation

We recommend creating a new conda environment to install and run OpenEA. You should first install tensorflow-gpu (tested on 1.8 and 1.12), graph-tool (tested on 2.27 and 2.29, the latest version would cause a bug), and python-igraph using conda:

conda create --name openea python=3.6 graph-tool==2.40 -c conda-forge
conda activate openea
conda install tensorflow-gpu==1.12
conda install -c conda-forge python-igraph

Then, OpenEA can be installed using pip with the following steps:

git clone https://github.com/nju-websoft/OpenEA.git OpenEA
cd OpenEA
pip install -e .

Usage

The following is an example about how to use OpenEA in Python (We assume that you have already downloaded our datasets and configured the hyperparameters as in the examples.)

import openea as oa

model = oa.kge_model.TransE
args = load_args("hyperparameter file folder")
kgs = read_kgs_from_folder("data folder")
model.set_args(args)
model.set_kgs(kgs)
model.init()
model.run()
model.test()
model.save()

More examples are available here

To run the off-the-shelf approaches on our datasets and reproduce our experiments, change into the ./run/ directory and use the following script:

python main_from_args.py "predefined_arguments" "dataset_name" "split"

For example, if you want to run BootEA on D-W-15K (V1) using the first split, please execute the following script:

python main_from_args.py ./args/bootea_args_15K.json D_W_15K_V1 721_5fold/1/

KG Sampling Method and Datasets

As the current widely-used datasets are quite different from real-world KGs, we present a new dataset sampling algorithm to generate a benchmark dataset for embedding-based entity alignment.

Iterative Degree-based Sampling

The proposed iterative degree-based sampling (IDS) algorithm simultaneously deletes entities in two source KGs with reference alignment until achieving the desired size, meanwhile retaining a similar degree distribution of the sampled dataset as the source KG. The following figure describes the sampling procedure.

Dataset Overview

We choose three well-known KGs as our sources: DBpedia (2016-10),Wikidata (20160801) and YAGO3. Also, we consider two cross-lingual versions of DBpedia: English--French and English--German. We follow the conventions in JAPE and BootEA to generate datasets of two sizes with 15K and 100K entities, using the IDS algorithm:

# Entities	Languages	Dataset names
15K	Cross-lingual	EN-FR-15K, EN-DE-15K
15K	English	D-W-15K, D-Y-15K
100K	Cross-lingual	EN-FR-100K, EN-DE-100K
100K	English-lingual	D-W-100K, D-Y-100K

The v1.1 datasets used in this paper can be downloaded from figshare, Dropbox or Baidu Wangpan (password: 9feb). (Note that, we have fixed a minor format issue in YAGO of our v1.0 datasets. Please download our v1.1 datasets from the above links and use this version for evaluation.)

(Recommended) The v2.0 datasets can be downloaded from figshare, Dropbox or Baidu Wangpan (password: nub1).

Dataset Statistics

We generate two versions of datasets for each pair of KGs to be aligned. V1 is generated by directly using the IDS algorithm. For V2, we first randomly delete entities with low degrees (d <= 5) in the source KG to make the average degree doubled, and then execute IDS to fit the new KG. The statistics of the datasets are shown below.

Dataset Description

We hereby take the EN_FR_15K_V1 dataset as an example to introduce the files in each dataset. In the 721_5fold folder, we divide the reference entity alignment into five disjoint folds, each of which accounts for 20% of the total alignment. For each fold, we pick this fold (20%) as training data and leave the remaining (80%) for validation (10%) and testing (70%). The directory structure of each dataset is listed as follows:

EN_FR_15K_V1/
├── attr_triples_1: attribute triples in KG1
├── attr_triples_2: attribute triples in KG2
├── rel_triples_1: relation triples in KG1
├── rel_triples_2: relation triples in KG2
├── ent_links: entity alignment between KG1 and KG2
├── 721_5fold/: entity alignment with test/train/valid (7:2:1) splits
│   ├── 1/: the first fold
│   │   ├── test_links
│   │   ├── train_links
│   │   └── valid_links
│   ├── 2/
│   ├── 3/
│   ├── 4/
│   ├── 5/

Experiment and Results

Experiment Settings

The common hyper-parameters used for OpenEA are shown below.

	15K	100K
Batch size for rel. triples	5,000	20,000
Termination condition	Early stop when the Hits@1 score begins to drop on the validation sets, checked every 10 epochs.
Max. epochs	2,000

Besides, it is well-recognized to split a dataset into training, validation and test sets. The details are shown below.

# Ref. alignment	# Training	# Validation	# Test
15K	3,000	1,500	10,500
100K	20,000	10,000	70,000

We use Hits@m (m = 1, 5, 10, 50), mean rank (MR) and mean reciprocal rank (MRR) as the evaluation metrics. Higher Hits@m and MRR scores as well as lower MR scores indicate better performance.

Detailed Results

The detailed and supplementary experimental results are list as follows:

Unexplored KG Embedding Models

Detailed results of unexplored KG embedding models on the 15K datasets

detailed_results_unexplored_models_15K.csv

Detailed results of unexplored KG embedding models on the 100K datasets

detailed_results_unexplored_models_100K.csv

License

This project is licensed under the GPL License - see the LICENSE file for details

Citation

If you find the benchmark datasets, the OpenEA library or the experimental results useful, please kindly cite the following paper:

@article{OpenEA,
  author    = {Zequn Sun and
               Qingheng Zhang and
               Wei Hu and
               Chengming Wang and
               Muhao Chen and
               Farahnaz Akrami and
               Chengkai Li},
  title     = {A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs},
  journal   = {Proceedings of the VLDB Endowment},
  volume    = {13},
  number    = {11},
  pages     = {2326--2340},
  year      = {2020},
  url       = {http://www.vldb.org/pvldb/vol13/p2326-sun.pdf}
}

If you use the DBP2.0 dataset, please kindly cite the following paper:

@inproceedings{DBP2,
  author    = {Zequn Sun and
               Muhao Chen and
               Wei Hu},
  title     = {Knowing the No-match: Entity Alignment with Dangling Cases},
  booktitle = {ACL},
  year      = {2021}
}

nju-websoft / OpenEA

readme