herminiogg / ShExML

A heterogeneous data mapping language based on Shape Expressions
http://shexml.herminiogarcia.com
MIT License
15 stars 2 forks source link

ShExML

Master build Maven Central DOI SWH

Shape Expressions Mapping Language (ShExML) is a DSL that offers a solution for mapping and merging heterogeneous data sources. As being based on ShEx the shape is the main foundation to define the transformations.

Example

PREFIX : <http://example.com/>
SOURCE films_xml_file <https://rawgit.com/herminiogg/ShExML/master/src/test/resources/films.xml>
SOURCE films_json_file <https://rawgit.com/herminiogg/ShExML/master/src/test/resources/films.json>
ITERATOR film_xml <xpath: //film> {
    FIELD id <@id>
    FIELD name <name>
    FIELD year <year>
    FIELD country <country>
    FIELD directors <directors/director>
}
ITERATOR film_json <jsonpath: $.films[*]> {
    FIELD id <id>
    FIELD name <name>
    FIELD year <year>
    FIELD country <country>
    FIELD directors <director>
}
EXPRESSION films <films_xml_file.film_xml UNION films_json_file.film_json>

:Films :[films.id] {
    :name [films.name] ;
    :year [films.year] ;
    :country [films.country] ;
    :director [films.directors] ;
}

This example shows how to map and merge two files (in JSON and XML) with different films. In the first part, the declarations, we can define some 'variables' that can be used inside the shapes. Prefixes used in the resulting RDF, sources to the files, iterators and fields (queries) to be applied over the files and expressions to merge and transform the queries results. Then, the shapes are defined as in ShEx but using the previously defined expressions or composing them inside the square brackets. More complex example can be seen under the films.shexml file.

Features

The full specification with all the supported features and examples can be consulted here.

Usage

CLI

A command line interface is offered under the jar library with the following options available:

Usage: ShExML [-hrsV] [-id] [-nu] [-rp] [-sh] [-shc] [-sm] [-d=<drivers>]
              [-f=<format>] -m=<file> [-o=<output>] [-p=<password>]
              [-u=<username>]
Map and merge heterogeneous data sources with a Shape Expressions based syntax
  -d, --drivers=<drivers>    Add more JDBC database drivers in the form of
                               <startJDBCURL>%<driver> and separating them with
                               ";". Example: jdbc:postgresql%org.postgresql.
                               Driver;jdbc:oracle%oracle.jdbc.OracleDriver
  -f, --format=<format>      Output format for RDF graph. Turtle, RDF/XML,
                               N-Triples, ...
  -h, --help                 Show this help message and exit.
      -id, --inferenceDatatypes
                             Use the inference system for choosing the best
                               suited datatype for the generated literal.
                               Without this option, and not declaring a
                               datatype in the mapping rules, all the literals
                               will be outputted as strings
  -m, --mapping=<file>       Path to the file with the mappings
      -nu, --normaliseURIs   Activate the URI normalisation system which allows
                               to avoid malformed URIs when using strings for
                               URI creation
  -o, --output=<output>      Path where the output file should be created
  -p, --password=<password>  Password in case of using a database
  -r, --rml                  Generate RML output
      -rp, --rmlPrettified   Generate RML output using Blank nodes for better
                               readability
  -s, --shex                 Generate ShEx validation
      -sh, --shacl           Generate SHACL validation
      -shc, --shaclClosed    Generate SHACL validation with closed shapes as
                               default
      -sm, --shapeMap        Generate Shape Map for ShEx validation
  -u, --username=<username>  Username in case of using a database
  -V, --version              Print version information and exit.

Therefore, to execute the films example: java -jar shexml.jar -m films.shexml

JVM compatible API

ShExML is coded in Scala and, because of that, it can be used with JVM compatible languages. See the example below on how to use the programmatic API.

val file = scala.io.Source.fromFile(pathToFile).mkString
val mappingLauncher = new MappingLauncher()
val output = mappingLauncher.launchMapping(file, "TURTLE")

Requirements

The minimal versions for this software to work are:

Webpage

A live playground is also offered online (http://shexml.herminiogarcia.com). However, due to hardware limitations it is not intended for intensive use.

Citation

This tool is part of a scientific project which has led to different publications. The main and preferred publication for citation is:

García-González, H., Boneva, I., Staworko, S., Labra-Gayo, J. E., & Lovelle, J. M. C. (2020). 
ShExML: improving the usability of heterogeneous data mapping languages for first-time users. 
PeerJ Computer Science, 6, e318. https://doi.org/10.7717/peerj-cs.318

Other possible publications per topic are:

Build

The library uses sbt as the package manager and building tool, therefore to compile the project you can use the following command:

$ sbt compile

To run the project from within sbt you can use the command below, where <options> can be replaced by the arguments explained in the CLI

$ sbt "run <options>"

To generate an executable JAR file you can call the following command. Take into account that if you want to test the library before generating the artifact you need to set up the testing environment as explained in the Testing section and omit the "set test in assembly := {}" option from the command.

$ sbt "set test in assembly := {}" clean update assembly

Testing

The project contains a full suite of tests that checks that all the features included in the engine work as expected. These tests units are included under the src/test/scala folder. To run them you can use the command below. Notice that it is of utmost importance to test that the project pass the test for all the cross-compiled versions used within the project (see the Cross-compilation section for more details.)

$ sbt test

The test environment uses some external resources that need to be set up before running them. This mainly involves starting a MySQL and a PostreSQL database, creating the relational schema and filling the tables up with the dummy data. This process is described on the Github workflow file.

Cross-compilation

The project is enabled to work with three different versions of Scala (i.e., 2.12.x, 2.13.x and 3.x) so it can be used across different Scala environments. Therefore, all the commands will work by default with the 3.x version but it is possible to run the same command for all the versions at the same time or just for one specific version. Below you can see how to do so with the test command.

Testing against all the cross-compiled versions:

$ sbt "+ test"

Testing against a specific version where is one of the configured versions in the build.sbt file:

$ sbt "++<version> test"

Dependencies

The following dependencies are used by this library:

Dependency License
org.antlr / antlr4 BSD-3-Clause
net.sf.saxon / Saxon-HE MPL-2.0
org.apache.jena / jena-base Apache License 2.0
org.apache.jena / jena-core Apache License 2.0
org.apache.jena / jena-arq Apache License 2.0
org.apache.jena / jena-shacl Apache License 2.0
info.picocli / picocli Apache License 2.0
org.slf4j / slf4j-nop MIT License
com.github.tototoshi / scala-csv Apache License 2.0
org.xerial / sqlite-jdbc Apache License 2.0
mysql / mysql-connector-java GPL-v2 (Universal FOSS Exception v1)
org.postgresql / postgresql BSD-2-Clause
org.mariadb.jdbc / mariadb-java-client LGPL-2.1
com.microsoft.sqlserver / mssql-jdbc MIT License
com.github.vickumar1981 / stringdistance Apache License 2.0
com.typesafe.scala-logging / scala-logging Eclipse Public License v1.0 or LGPL-2.1
com.jayway.jsonpath / json-path Apache License 2.0
org.scala-lang / scala-reflect Apache License 2.0
org.scala-lang / scala-compiler Apache License 2.0

For performing a more exhaustive licenses check, including subdependecies and testing ones the sbt-license-report plugin is included in the project, enabling the generation of a report with the command:

$ sbt dumpLicenseReport

The results are available, after the execution of this command, under the directory target/license-reports.