TabbyLD2 is a web-based application for semantic annotation of relational tables and generation of facts from annotated tabular data to populate knowledge graphs.
0.4
A source (input) table represents a set of the same type entities in a relational form (a subset of the Cartesian product of K-data domains), where:
A table of the same type entities (a canonicalized form) is a relational table in the third normal form (3NF), which contains an ordered set of N-rows and M-columns.
A table represents a set of entities of the same type, where:
Assumption 1. The first row of a source table is a header containing attribute (column) names.
Assumption 2. All values of column cells in a source table have the same entity types and data types.
TabbyLD2 supports a semantic interpretation (annotation) of separate elements of a source table by using a target knowledge graph. DBpedia is used as a target knowledge graph.
A knowledge graph is a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities (Hogan et al. Knowledge Graphs. ACM Computing Surveys, 54(4), 2022, 1-37). So, a knowledge graph contains:
Semantic Table Interpretation (STI) is the process of recognizing and linking tabular data with external concepts from a target knowledge graph, which includes three main tasks:
First, you need to clone the project into your directory:
git clone https://github.com/tabbydoc/tabbyld2.git
Next, you need to install all requirements for this project:
pip install -r requirements.txt
We recommend you to use Python 3.7 or more.
In addition to SPARQL queries, we use DBpedia Lookup to find candidate entities from DBpedia. This service requires a separate installation.
TabbyLD2 uses ColNet for CTA task. ColNet is a framework based on Convolutional Neural Networks (CNNs) to predict the most suitable (relevant) class from a set of candidates for each named entity column. The ColNet framework uses Tensorflow as a machine learning platform.
ColNet requirements:
NOTE 1: to use sparql.query
in Python 3.7 and more, go to sparql library, find the IRI class and change return of the __str__()
function.
From:
def __str__(self):
return self.value.encode("unicode-escape")
to:
def __str__(self):
return self.value
NOTE 2: The version of Tensorflow and Keras libraries must match.
datasets
contains datasets of source tables for experimental evaluation:
T2Dv2
contains T2Dv2 Gold Standard dataset, where col_class_checked_fg.csv
was formed by SemAIDA and is fine-grained ground truth class for all columns;Tough_Tables
contains Tough Tables (2T) dataset. NOTE: CEA_2T_gt.zip
must be unzipped before receiving an experimental evaluation;GitTables_SemTab_2022
contains GitTables dataset that was applied in the SemTab-2022 competition for Column Type Annotation by DBpedia (GT-CTA-DBP).examples
contains table examples in the CSV format for testing;experimental_evaluation
contains scripts for obtaining an experimental evaluation on tables presented in datasets
directory;results
contains processing results of tables (this directory is created by default);source_tables
is the folder in which you need to place CSV files of source tables for processing (contains two table files for testing by default);tabbyld2
contains TabbyLD2 modules, including main.py
for a console mode and app.py
for a web mode, and also:
datamodel
contains description of tabular data and knowledge graph models;helpers
contains various useful functions for working with files, data, etc.;preprocessing
contains table preprocessing module, which includes data cleaning, atomic column classification, subject column identification;table_annotation
contains semantic table annotator for CEA and CTA tasks. This module also contains:
colnet
contains ColNet framework for annotating categorical columns (NE-columns);w2v_model
contains pre-train word2vec model. NOTE: this model is installed and placed independently.In order to use the TabbyLD2 in console mode, you may run the following command:
python main.py
Run this script to process source tables in CSV format. Tables must be located in the source_tables
directory.
The processing result are presented as JSON format and will be saved to the results
directory (json
and provenance
subdirectories).
In order to use the TabbyLD2 in web mode, you may run the following command:
python app.py
NOTE: This mode does not work at the moment!