lavis-nlp / irt

Open World KGC - IRT Dataset
MIT License
7 stars 1 forks source link

Inductive Reasoning with Text (IRT)

Code style: black PyPI version

:warning: Please consider using the successor dataset: IRT2 Repository

Table of Contents

This code is used to create benchmark datasets as described in Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery from a given knowledge graph (i.e. triple set) and supplementary text. The two KG's evaluated in the paper (based on FB15k237 and CoDEx) are available for download below.

Download

We offer two IRT reference datasets: The first - IRT-FB - is based on FB15k237 and the second - IRT-CDE - utilizes CoDEx. Each dataset offers knowledge graph triples for the closed world (cw) and open world (ow) split. The ow-split is partitioned into validation and test data. Each entity of the KG is assigned a set of text contexts of mentions of that entity.

Name Description Download
IRT-CDE Based on CoDEx Link
IRT-FB Based on FB15k237 Link

Installation

Python 3.9 is required. We recommend miniconda for managing Python environments.

conda create -n irt python=3.9
conda activate irt
pip install irt-data

The requirements.txt contains additional packages used for development.

Loading

Simply provide a path to an IRT dataset folder. The data is loaded lazily - that is why the construction is fast, but the first invocation of .description takes a while.

from irt import Dataset
dataset = Dataset('path/to/irt-fb')
print(dataset.description)
IRT DATASET

IRT GRAPH: irt-fb
  nodes: 14541
  edges: 310116 (237 types)
  degree:
    mean 42.65
    median 26

IRT SPLIT
2389 retained concepts

Config:
  seed: 26041992
  ow split: 0.7
  ow train split: 0.5
  relation threshold: 100
  git: 66fe7bd3c934311bdc3b1aa380b7c6c45fd7cd93
  date: 2021-07-21 17:29:04.339909

Closed World - TRAIN:
  owe: 12164
  entities: 12164
  heads: 11562
  tails: 11252
  triples: 238190

Open World - VALID:
  owe: 1558
  entities: 9030
  heads: 6907
  tails: 6987
  triples: 46503

Open World - TEST:
  owe: 819
  entities: 6904
  heads: 4904
  tails: 5127
  triples: 25423

IRT Text (Mode.CLEAN)
  mean contexts: 28.92
  median contexts: 30.00
  mean mentions: 2.84
  median mentions: 2.00

Data Formats

The data in the respective provided dataset folders should be quite self-explanatory. Each entity and each relation is assigned a unique integer id (denoted e [entity], h [head], t [tail], and r [relation]). There is folder containing the full graph data (graph/), a folder containing the open-world/closed-world splits (split/) and the textual data (text/).

Graph Data

This concerns both data in graph/ and split/. Entity and relation identifier can be translated with the graph/entities.txt and graph/relations.txt respectively. Triple sets come in h t r order. Reference code to load graph data:

Text Data

The upstream system that sampled our texts: ecc. All text comes gzipped and can be opened using the built-in python gzip library. For inspection, you can use the zcat, zless, zgrep, etc. (at least on unixoid systems ;)) - or extract them using unzip. Reference code to load text data:

PyKEEN Dataset

For users of pykeen. There are two "views" on the triple sets: closed-world and open-world. Both simply offer pykeen TriplesFactories with an id-mapping to the IRT entity-ids.

Closed-World:

from irt import Dataset
from irt import KeenClosedWorld

dataset = Dataset('path/to/dataset')

# 'split' is either a single float, a tuple (for an additional
# test split) or a triple which must sum to 1
kcw = KeenClosedWorld(dataset=dataset, split=.8, seed=1234)

print(kcw.description)
IRT PYKEEN DATASET
irt-cde

  training triples factory:
    entities: 12091
    relations: 51
    triples: 109910

  validation triples factory:
    entities: 12091
    relations: 51
    triples: 27478

It offers .training, .validation, and .testing TriplesFactories, and irt2keen/keen2irt id-mappings.

Open-World:

from irt import Dataset
from irt import KeenClosedWorld

dataset = Dataset('path/to/dataset')
kow = KeenOpenWorld(dataset=ds)

print(kow.description)
IRT PYKEEN DATASET
irt-cde

  closed world triples factory:
    entities: 12091
    relations: 51
    triples: 137388

  open world validation triples factory:
    entities: 15101
    relations: 46
    triples: 41240

  open world testing triples factory:
    entities: 17050
    relations: 48
    triples: 27577

It offers .closed_world, .open_world_valid, and .open_world_test TriplesFactories, and irt2keen/keen2irt id-mappings.

Pytorch Dataset

For users of pytorch and/or pytorch-lightning.

We offer a torch.utils.data.Dataset, a torch.utils.data.DataLoader and a pytorch_lightning.DataModule. The dataset abstracts what a "sample" is and how to collate samples to batches:

from irt import TorchDataset

# given you have loaded a irt.Dataset instance called "dataset"
# 'model_name' is one of huggingface.co/models
torch_dataset = TorchDataset(
    model_name='bert-base-cased',
    dataset=dataset,
    part=dataset.split.closed_world,
)

# a sample is an entity-to-token-index mapping:
torch_dataset[100]
# -> Tuple[int, List[int]]
# (124, [[101, 1130, ...], ...])

# and it offers a collator for batching:
batch = TorchDataset.collate_fn([torch_dataset[0], torch_dataset[1]])
# batch: Tuple[Tuple[int], torch.Tensor]

len(batch[0])   # -> 60
batch[1].shape  # -> 60, 105

Note: Only the first invocation is slow, because the tokenizer needs to run. The tokenized text is saved to the IRT folder under torch/ and re-used from then on.

Bring Your Own Data

If you want to utilize this code to create your own open-world/closed-world-split, you need to either bring your data in a format readable by the existing code base or extend this code for your own data model. See ipynb/graph.split.ipynb for a step-by-step guide.

Legacy Data

This data is used as upstream source or was used in the original experiments for the paper. They are left here for documentation and to allow for reproduction of the original results. You need to go back to this commit in irtm to use the data for model training.

Name Description Download
fb-contexts-v7 Original dataset (our text) as used in the paper (all modes, all context sizes) Link
fb-owe Original dataset (Wikidata descriptions provided by shah/OWE) Link
fb-db-contexts-v7 Our text sampled by ecc for FB Link
cde-contexts-v7 Original dataset (our text) as used in the paper (all modes, all contexts sizes) Link
cde-codex.en Original dataset (Texts provided by tsafavi/codex) Link
cde-db-contexts-v7 Our text sampled by ecc for CDE Link

Citation

If this is useful to you, please consider a citation:

@inproceedings{hamann2021open,
  title={Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery},
  author={Hamann, Felix and Ulges, Adrian and Krechel, Dirk and Bergmann, Ralph},
  booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
  pages={252--264},
  year={2021},
  organization={Springer}
}