pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
81 stars 17 forks source link

SHACL for bibliographic records #209

Open pkiraly opened 1 year ago

pkiraly commented 1 year ago

Introduction. The Shapes Constraint Language (SHACL) is a formal language for validating Resource Description Framework (RDF) graphs against a set of conditions (expressed also in RDF). Following this idea and implementing a subset of the language, MQAF API provides a mechanism to define SHACL-like rules for data sources in non-RDF based formats, such as XML, CSV and JSON (SHACL validates only RDF graphs). The rules can be defined either with YAML or JSON configuration files or with Java code.a MQAF API has already been validated in different organisations (Flemish Audio-Visual Archives, Victoria and Albert Museum, Deutsche Digitale Bibliothek). In this suggested research we extend this ruleset to be applicable to MARC records. For making it available for MARC records there are two requirements:

In this research we have two control data sets. BL provided a sample with examples where particular problems have been caught by an alternative tool, but not by QA catalogue. KBR developed an XSLT based solution to check local rulesets. We can use both to compare those results with ours.

During the research students will learn about the following technologies: MARC, SHACL, XPath, JSONPath, MARCspec.

Research questions and tasks (Computer Science):

Research questions (LIS / Humanities):

Potential partners:

References:

For the context of this task only a limited subset of MARCSpec can be used, the is implemented in the MarcSpec class. The most important components of the address are:

The MarcSpec object's constructor expects a MARCSpec-conform string an an input, such as new MarcSpec("245$a"), and the BibliographicRecord class has a select(MarcSpec selector) method which returns a list of strings (List<String>). Thus way it is realively easy to fetch information from the MARC record, however this output is not fit well as the input of SHACL validation.

Example use cases:

Tasks

pkiraly commented 1 year ago
- name: 040$a
  path: 040$a
  rules:
  - minCount: 1
  - pattern: ^BE-KBR00