We present GTFS-Madrid-Bench, a benchmark to evaluate declarative KG construction engines that can be used for the provision of access mechanisms to (virtual) knowledge graphs. Our proposal introduces several scenarios that aim at measuring performance and scalability as well as the query capabilities of all this kind of engines, considering their heterogeneity. The data sources used in our benchmark are derived from the GTFS data files of the subway network of Madrid. They can be transformed into several formats (CSV, JSON, SQL and XML) and scaled up. The query set aims at addressing a representative number of SPARQL 1.1 features while covering usual queries that data consumers may be interested in.
David Chaves-Fraga, Freddy Priyatna, Andrea Cimmino, Jhon Toledo, Edna Ruckhaus, & Oscar Corcho (2020). GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain. Journal of Web Semantics, 65. Online
Citing GTFS-Madrid-Bench: If you used GTFS-Madrid-Bench in your work, please cite as:
@article{chaves2020gtfs,
title={GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain},
author={Chaves-Fraga, David and Priyatna, Freddy and Cimmino, Andrea and Toledo, Jhon and Ruckhaus, Edna and Corcho, Oscar},
journal={Journal of Web Semantics},
volume={65},
pages={100596},
year={2020},
doi={https://doi.org/10.1016/j.websem.2020.100596},
publisher={Elsevier}
}
To have locally installed docker.
Decide the distributions to be used for your testing. They can be:
docker run --pull always -itv "$(pwd)":/output oegdataintegration/gtfs-bench
docker pull oegdataintegration/gtfs-bench
and then docker run -itv "$(pwd)":/output oegdataintegration/gtfs-bench
seed
value can be provided to generate different changes to simulate multiple changed dumps. The following changes can be generated:
Demo usage:
result.zip
in the current working directory. The folders structure are: one folder for datasets and other for the queries (for virtual KG). Inside the datasets folder will be one folder for each distribution (e.g., csv, sql, custom), and in each distribution folder we provide the required sizes (each size in one folder), the corresponding mapping associated to the distribution, and the SQL schemes if they are needed. Consider that for not repeating resources at scale level, the mappings and SQL paths to the data are define at distribution level (e.g, "data/AGENCY.csv") and their management for performing a correct evaluation has to be done by the user (with an script, for example). You can visit the utils folder where we provide some ideas on how to manage it. See the following example:.
├── datasets
│ ├── csv
│ │ ├── 1
│ │ │ ├── AGENCY.csv
│ │ │ ├── CALENDAR.csv
│ │ │ ├── CALENDAR_DATES.csv
│ │ │ ├── FEED_INFO.csv
│ │ │ ├── FREQUENCIES.csv
│ │ │ ├── ROUTES.csv
│ │ │ ├── SHAPES.csv
│ │ │ ├── STOPS.csv
│ │ │ ├── STOP_TIMES.csv
│ │ │ └── TRIPS.csv
│ │ ├── 2
│ │ │ ├── AGENCY.csv
│ │ │ ├── CALENDAR.csv
│ │ │ ├── CALENDAR_DATES.csv
│ │ │ ├── FEED_INFO.csv
│ │ │ ├── FREQUENCIES.csv
│ │ │ ├── ROUTES.csv
│ │ │ ├── SHAPES.csv
│ │ │ ├── STOPS.csv
│ │ │ ├── STOP_TIMES.csv
│ │ │ └── TRIPS.csv
│ │ ├── 3
│ │ │ ├── AGENCY.csv
│ │ │ ├── CALENDAR.csv
│ │ │ ├── CALENDAR_DATES.csv
│ │ │ ├── FEED_INFO.csv
│ │ │ ├── FREQUENCIES.csv
│ │ │ ├── ROUTES.csv
│ │ │ ├── SHAPES.csv
│ │ │ ├── STOPS.csv
│ │ │ ├── STOP_TIMES.csv
│ │ │ └── TRIPS.csv
│ │ └── mapping.csv.nt
│ ├── json
│ │ ├── 1
│ │ │ ├── AGENCY.json
│ │ │ ├── CALENDAR_DATES.json
│ │ │ ├── CALENDAR.json
│ │ │ ├── FEED_INFO.json
│ │ │ ├── FREQUENCIES.json
│ │ │ ├── ROUTES.json
│ │ │ ├── SHAPES.json
│ │ │ ├── STOPS.json
│ │ │ ├── STOP_TIMES.json
│ │ │ └── TRIPS.json
│ │ ├── 2
│ │ │ ├── AGENCY.json
│ │ │ ├── CALENDAR_DATES.json
│ │ │ ├── CALENDAR.json
│ │ │ ├── FEED_INFO.json
│ │ │ ├── FREQUENCIES.json
│ │ │ ├── ROUTES.json
│ │ │ ├── SHAPES.json
│ │ │ ├── STOPS.json
│ │ │ ├── STOP_TIMES.json
│ │ │ └── TRIPS.json
│ │ ├── 3
│ │ │ ├── AGENCY.json
│ │ │ ├── CALENDAR_DATES.json
│ │ │ ├── CALENDAR.json
│ │ │ ├── FEED_INFO.json
│ │ │ ├── FREQUENCIES.json
│ │ │ ├── ROUTES.json
│ │ │ ├── SHAPES.json
│ │ │ ├── STOPS.json
│ │ │ ├── STOP_TIMES.json
│ │ │ └── TRIPS.json
│ │ └── mapping.json.nt
│ └── sql
│ ├── 1
│ │ ├── AGENCY.csv
│ │ ├── CALENDAR.csv
│ │ ├── CALENDAR_DATES.csv
│ │ ├── FEED_INFO.csv
│ │ ├── FREQUENCIES.csv
│ │ ├── ROUTES.csv
│ │ ├── SHAPES.csv
│ │ ├── STOPS.csv
│ │ ├── STOP_TIMES.csv
│ │ └── TRIPS.csv
│ ├── 2
│ │ ├── AGENCY.csv
│ │ ├── CALENDAR.csv
│ │ ├── CALENDAR_DATES.csv
│ │ ├── FEED_INFO.csv
│ │ ├── FREQUENCIES.csv
│ │ ├── ROUTES.csv
│ │ ├── SHAPES.csv
│ │ ├── STOPS.csv
│ │ ├── STOP_TIMES.csv
│ │ └── TRIPS.csv
│ ├── 3
│ │ ├── AGENCY.csv
│ │ ├── CALENDAR.csv
│ │ ├── CALENDAR_DATES.csv
│ │ ├── FEED_INFO.csv
│ │ ├── FREQUENCIES.csv
│ │ ├── ROUTES.csv
│ │ ├── SHAPES.csv
│ │ ├── STOPS.csv
│ │ ├── STOP_TIMES.csv
│ │ └── TRIPS.csv
│ └── mapping.sql.nt
│ └── schema.sql
└── queries
├── q10.rq
├── q11.rq
├── q12.rq
├── q13.rq
├── q14.rq
├── q15.rq
├── q16.rq
├── q17.rq
├── q18.rq
├── q1.rq
├── q2.rq
├── q3.rq
├── q4.rq
├── q5.rq
├── q6.rq
├── q7.rq
├── q8.rq
└── q9.rq
Additionally to the generator engine, that provides the data at desirable scales and distributions, together with corresponding mappings and queries, there are also common resources openly available to be modified or used by any practicioner or developer:
Our experiences testing (virtual) knowledge graph engines have revealed the difficulties for setting up an infrastructure where many variables and resources are involved: databases, raw data, mappings, queries, data paths, mapping paths, databases connections, etc. For that reason, and in order to facilitate the use of the benchmark to any developer or practitioner, we provide a set of utils such as docker-compose templates or evaluation bash scripts that, in our opinion, can reduce the time for preparing the testing set up.
Moreover, the utils folder contains a series of scripts for evaluating Façade-based data access engines (e.g. SPARQL Anything) more details.
We highly recommend that (virutalizers or materializers) KG construction engines tested with this benchmark provide (at least) the following metris:
For virtual knowledge graphs systems, we also encourage developers and tester to provide:
*R Package available at: https://github.com/dachafra/dief (extension from https://github.com/maribelacosta/dief) and Python PyPi module available at https://pypi.org/project/diefpy/ (provided by SDM-TIB)
All the datasets generated by this benchmark have to follow the license of the Consorcio Regional de Transporte de Madrid: https://www.crtm.es/licencia-de-uso?lang=en
We know that there are variables and dimensions that we did not take into account in the current version of the benchmark (e.g., transformation function defined in the mapping rules). If you are interested in collaborate with us in a new version of the benchmark, send us an email or open a new discussion!
Ontology Engineering Group, October 2019 - Present