lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
89 stars 35 forks source link
bibframe conversion marcxml rdf yaz

CircleCI

marc2bibframe2

XSLT-based conversion from MARCXML to BIBFRAME 2.0

Introduction

This repository contains an XSLT 1.0 application for converting MARCXML records to RDF/XML, using the BIBFRAME 2.0 and MADSRDF ontologies. The expected input is a MARCXML record or collection, and the output is an XML document expressing the data as a set of RDF triples in the striped RDF/XML syntax. In addition, there is a sample configuration for the Metaproxy search gateway server from Index Data, showing the integration of the application with Metaproxy to provide both a "static" conversion of MARC records and an "active" conversion that attempts to resolve identifiers for configured entities.

The specification for the conversion has been published by the Library of Congress at http://www.loc.gov/bibframe/mtbf/.

Using the converter

In the simplest case, you can invoke an XSLT processor with the main stylesheet (xsl/marc2bibframe2.xsl) as the first argument, and an XML file containing MARCXML as the second:

xsltproc xsl/marc2bibframe2.xsl test/data/marc.xml

Preprocessing (new as of Oct 2023)

An option preprocessing step will attempt to split individual MARC records into multiple MARC records with the additional MARC records representing different Instances of the same Work in the original or source MARC record. It takes a MARC/XML record as input and will output a marc:collection of one or more MARC records. See the Preprocess 0 document in the spec/ directory. Like the main stylesheet, it can be invoked:

xsltproc xsl/ConvSpec-Preprocess0-Splitting.xsl test/data/marc.xml

More information about this process was presented in July 2023. That presentation can be viewed in full or the [slides downloaded](<https://www.loc.gov/bibframe/pdf/LD4-Breaking%20News-Splitting MARC records-20230712.pdf>).

Converter parameters

The converter supports several optional parameters:

Different XSLT processors have different syntaxes for passing parameters. For xsltproc, the syntax is:

xsltproc --stringparam baseuri http://mylibrary.org/ --stringparam idsource http://id.loc.gov/vocabulary/organizations/dlc xsl/marc2bibframe2.xsl test/data/marc.xml

For Metaproxy integration, the converter parameters can be passed to the stylesheets using the <param> element in the YAZ configuration:

<xslt stylesheet="xsl/marc2bibframe2.xsl">
  <param name="baseuri" value="http://mylibrary.org/"/>
</xslt>

Converter configuration

Some elements of the conversion can be configured using XML files in the xsl/conf directory. This includes, e.g., language mappings for elements generated by 880 tags, and subject thesaurus mappings for MADSRDF elements generated by 6XX tags.

Converter design

The main stylesheet of the XSLT converter application, xsl/marc2bibframe.xsl, uses push processing to process the fields of each MARC record and build the two main elements it generates, a bf:Work and a bf:Instance. In addition, the fields are pushed through to generate a bflc:adminMetaData property of the bf:Work and to generate bf:hasItem properties of the bf:Instance.

Elements in the resulting RDF/XML document that are not blank nodes or nodes with statically determined URIs are given newly minted URIs constructed from the stem of the baseuri parameter (default http://example.org/), the record ID of the MARC record (by default the value of the 001 field), and a hash URI for the new element. For elements that are not the main bf:Work or bf:Instance element generated by the record, the hash URI is constructed from the element class, the field number, and the position of the field in the MARC record, e.g.:

http://example.org/13600108#Agent100-12

The templates that match the MARC fields are contained in included stylesheets from the main stylesheet, along with some utility templates in the utils.xsl stylesheet and templates for matching control subfields in the ConvSpec-ControlSubfields.xsl stylesheet. Configuration information is read into variables using the document() function.

As much as possible, templates representing each specification document in the specifications are contained in a stylesheet with the same name, for easier maintenance.

Testing

Each of the specification documents in the specifications is represented in a corresponding test suite in the test directory, with test data in the test/data directory.

The tests are written for the XSpec testing framework, a behavior driven development testing framework for XSLT and XQuery. To run the tests, you must install the Saxon XSLT and XQuery processor as well as XSpec. Installation instructions are available on the XSpec wiki.

Once you have XSpec installed, you can run the entire test suite with the command (for Mac OS or Linux):

xspec.sh test/marc2bibframe2.xspec

Test reports will be output in the test/xspec directory.

Testing for LoC-specific conversion

There are a few conversion behaviors that are specific to the Library of Congress. For example, the Library of Congress uses a locally-defined 859 field as an analogue to the standard 856. To test LoC-specific conversions, run only the ConvSpec-DLC.xspec test suite:

xspec.sh test/ConvSpec-DLC.xspec

Active record conversion

Active conversion of records - resolving URIs for elements of the RDF/XML output from authoritative sources, like the Library of Congress Name Authority File, is achieved through a retrieval tool conversion in the YAZ toolkit.

The retrieval tool in YAZ is driven by an XML configuration, documented in the YAZ User's Guide and Reference. The YAZ conversion for RDF/XML is called rdf-lookup, and a simple configuration looks like this:

<backend syntax="xml" name="rdf-lookup">
  <xslt stylesheet="xsl/marc2bibframe2.xsl"/>
  <rdf-lookup debug="1">
    <namespace prefix="bf" href="http://id.loc.gov/ontologies/bibframe/" />
    <namespace prefix="bflc" href="http://id.loc.gov/ontologies/bflc/"/>
    <lookup xpath="//bf:contribution/bf:Contribution/bf:agent/bf:Agent">
      <key field="bflc:name00MatchKey"/>
      <key field="bflc:name01MatchKey"/>
      <key field="bflc:name11MatchKey"/>
      <server url="http://id.loc.gov/authorities/names/label/%s" method="HEAD"/>
    </lookup>
  </rdf-lookup>
</backend>

From the YAZ User's Guide:

The debug="1" attribute tells the filter to add XML comments to the key nodes that indicate what lookup it tried to do, how it went, and how long it took. The namespace prefix bf: is defined in the namespace tags. These namespaces are used in the xpath expressions in the lookup sections. The lookup tag specifies one tag to be looked up. The xpath attribute defines which node to modify. It may make use of the namespace definitions above. The server tag gives the URL to be used for the lookup. A %s in the string will get replaced by the key value. If there is no server tag, the one from the preceding lookup section is used, and if there is no previous section, the id.loc.gov address is used as a default. The default is to make a GET request, this example uses HEAD.

A full sample configuration is available in this directory as record-conv.xml. Using this configuration, you could perform an active conversion of a MARCXML file using the yaz-record-conv utility like so:

yaz-record-conv record-conv.xml test/data/marc.xml

The rdf-lookup conversion support was first introduced in YAZ v5.19.0. YAZ 5.20.0 provided a significant performance improvement for HEAD requests, so using that version or higher is highly recommended.

Metaproxy integration

Both the static and active conversions can be easily integrated into Index Data's Metaproxy metasearch gateway software as a record output format. A sample filter configuration is in the metaproxy directory. With this filter configuration, an SRU request to the server like http://metaproxy.mylibrary.org/?version=1.1&operation=searchRetrieve&query=rec.id%3D13600108&recordSchema=bibframe2&startRecord=1&maximumRecords=1 would retrieve and display the requested record converted into BIBFRAME triples in RDF/XML format. The install-filters.sh script in that directory would deploy the filters into a running Metaproxy configuration.

In addition, we have provided a Vagrantfile and Ansible playbook to build a local Metaproxy VM using VirtualBox for testing, available in the deploy directory.

Known issues

Repository contents

Dependencies

License

As a work of the United States government, this project is in the public domain within the United States.

Additionally, we waive copyright and related rights in the work worldwide through the CC0 1.0 Universal public domain dedication.

Legal Code (read the full text).

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.