Introduction. The Shapes Constraint Language (SHACL) is a formal language for validating Resource Description Framework (RDF) graphs against a set of conditions (expressed also in RDF). Following this idea and implementing a subset of the language, MQAF API provides a mechanism to define SHACL-like rules for data sources in non-RDF based formats, such as XML, CSV and JSON (SHACL validates only RDF graphs). The rules can be defined either with YAML or JSON configuration files or with Java code.a MQAF API has already been validated in different organisations (Flemish Audio-Visual Archives, Victoria and Albert Museum, Deutsche Digitale Bibliothek). In this suggested research we extend this ruleset to be applicable to MARC records. For making it available for MARC records there are two requirements:
supporting a particular "addressing scheme" which fits MARC records. This scheme is similar to XPATH or JSONPath which are mechanisms to precisely select a part of an XML and JSON document. For MARC there is a proposal, Carsten Klee's MARCspec - a common MARC record path language, which is already supported by the QA catalogue.
implementing a particular interface of MQAF API, which could return a unified value object when we specify it with an address of a data element.
In this research we have two control data sets. BL provided a sample with examples where particular problems have been caught by an alternative tool, but not by QA catalogue. KBR developed an XSLT based solution to check local rulesets. We can use both to compare those results with ours.
During the research students will learn about the following technologies: MARC, SHACL, XPath, JSONPath, MARCspec.
Research questions and tasks (Computer Science):
Finish the implementation of MARCspec in QA catalogue
Create a selector class which could find and retrieve the part of the record which is addressed by the MARCspec expression (other implementation of the interface are already available for Xpath, JSONPath and SQL column name expressions)
Adapt MQAF API that it should accept this selector implementation as input parameter (right now the implementations are hard coded)
Create a command line interface in QA catalogue which accepts a configuration file describing SHACL-like ruleset
Research questions (LIS / Humanities):
How can the content specific cataloguing rules provided by BL and KBR be transformed to SHACL-like machine readable rules? Are there limitations, and if yes: what is their nature?
What suggestions might this research have to the SHACL community and to the further development of QA catalogue?
Comparing the results of KBR and QA catalogue approaches, what suggestions this research might have to both parties?
For the context of this task only a limited subset of MARCSpec can be used, the is implemented in the MarcSpec class. The most important components of the address are:
a field: <field tag>, e.g. LDR (Leader) or 245
part of a field value by character positions: <field>/<start>-<end>, e.g. ´LDR/0-4` (the first 5 character of the Leader)
subfield of data field: <field>$<code>, e.g. 245$a
indicator of a data field: <field>_1 or <field>_2, e.g. 880_1
The MarcSpec object's constructor expects a MARCSpec-conform string an an input, such as new MarcSpec("245$a"), and
the BibliographicRecord class has a select(MarcSpec selector) method which returns a list of strings (List<String>). Thus way it is realively easy to fetch information from the MARC record, however this output is not fit well as the input of SHACL validation.
Example use cases:
040$a is mandatory (should have at least one occurrence in every record), and has to contain "BE-KBR00"
041$a is mandatory
041$b is optional
245$6 is optional. If used, it should only contain "880-0X" with X being a digit
if the record follows ISBD punctuation (LEADER/18 = "i") and a colon (":") is present in the field 245$a (title), 245$b (remainder of title) is mandatory
Introduction. The Shapes Constraint Language (SHACL) is a formal language for validating Resource Description Framework (RDF) graphs against a set of conditions (expressed also in RDF). Following this idea and implementing a subset of the language, MQAF API provides a mechanism to define SHACL-like rules for data sources in non-RDF based formats, such as XML, CSV and JSON (SHACL validates only RDF graphs). The rules can be defined either with YAML or JSON configuration files or with Java code.a MQAF API has already been validated in different organisations (Flemish Audio-Visual Archives, Victoria and Albert Museum, Deutsche Digitale Bibliothek). In this suggested research we extend this ruleset to be applicable to MARC records. For making it available for MARC records there are two requirements:
In this research we have two control data sets. BL provided a sample with examples where particular problems have been caught by an alternative tool, but not by QA catalogue. KBR developed an XSLT based solution to check local rulesets. We can use both to compare those results with ours.
During the research students will learn about the following technologies: MARC, SHACL, XPath, JSONPath, MARCspec.
Research questions and tasks (Computer Science):
Research questions (LIS / Humanities):
Potential partners:
References:
For the context of this task only a limited subset of MARCSpec can be used, the is implemented in the
MarcSpec
class. The most important components of the address are:<field tag>
, e.g.LDR
(Leader) or245
<field>/<start>-<end>
, e.g. ´LDR/0-4` (the first 5 character of the Leader)<field>$<code>
, e.g.245$a
<field>_1
or<field>_2
, e.g.880_1
The
MarcSpec
object's constructor expects a MARCSpec-conform string an an input, such asnew MarcSpec("245$a")
, and theBibliographicRecord
class has aselect(MarcSpec selector)
method which returns a list of strings (List<String>
). Thus way it is realively easy to fetch information from the MARC record, however this output is not fit well as the input of SHACL validation.Example use cases:
Tasks