ODIN benchmark for data extraction solutions for structured data. The benchmark is designed to evaluate the backend of these solutions (especially the acquisition phase) by simulating the ingestion, storage and retrieval of streams of RDF data. To this end, ODIN emulates loads faced by triple store during the insertion of triples by an extraction solution for enterprise data (e.g., industry sensors) based on models derived from real data. The key performance indicators during the evaluation are completeness and efficiency.
Guidelines on how to upload a benchmark can be found here: https://github.com/hobbit-project/platform/wiki/Benchmark-your-system
If you want to run ODIN using the platform, please follow the guidelines found here: https://github.com/hobbit-project/platform/wiki/Experiments
The current docker files can be found here: https://github.com/hobbit-project/odin/tree/master/docker
(must build the benchmark first) ODIN consists of 4 basic components:
If a user wants to create docker images for OdinBenchmarkController, OdinEvaluationModule and OdinTaskGenerator, he/she must use the following commands:
FROM java
ADD target/odin-1.0.0-SNAPSHOT.jar /odin/odin.jar
WORKDIR /odin
CMD java -cp odin.jar org.hobbit.core.run.ComponentStarter org.hobbit.odin.odintaskgenerator.X
where X is the name of the corresponding ODIN component. This docker file tells:
If the user wants to create docker image for OdinDataGenerator, he/she must use the following commands:
FROM maven:3.3.9-jdk-8
ADD target/odin-1.0.0-SNAPSHOT.jar /odin/odin.jar
ADD scripts/download.sh /odin/download.sh
WORKDIR /odin
CMD java -cp odin.jar org.hobbit.core.run.ComponentStarter org.hobbit.odin.odindatagenerator.OdinDataGenerator
which is the same as the previous example apart from the line ADD scripts/download.sh /odin/download.sh. This line adds the script download.sh (included in the repository) into the docker container working directory /odin/, so that the user can run ODIN using the TWIG mimicking algorithm.
Duration of the benchmark: The user must determine the duration of the task by assigning a value in milliseconds to the field. The default value is set to 600,000ms. Note that the duration of each experiment is at most 40min.
Name of mimicking algorithm output folder: The relative path of the output dataset folder. Default value = output_data/.
Number of insert queries per stream: This value is responsible for determining the number of INSERT SPARQL queries after which a SELECT query is performed. The default value is set to 100.
Population of generated data: This value determines the number of events generated by a mimicking algorithm for one Data Generator. Note that this value might not be equal to the number of generated triples. The default value is set to 1000.
Number of data generators - agents: The number of Data Generators for this experiment. The default value is 2.
Name of mimicking algorithm: The name of the mimicking algorithm to be invoked to generate data. There are two available values: TRANSPORT_DATA (https://github.com/PoDiGG/podigg), that invokes the mimicking algorithm developed by iMec for public transport and (TWIG) (https://github.com/AKSW/TWIG), that invokes the mimicking algorithm for Twitter messages. The default value is TRANSPORT_DATA.
Seed for mimicking algorithm: The seed value for a mimicking algorithm. The default value is 100.
Number of task generators - agents: The number of Task Generators for this experiment. The default value is 1.