lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
88 stars 35 forks source link

Stripping leading/trailing spaces on URI creation #207

Closed RichardWallis closed 2 years ago

RichardWallis commented 3 years ago

Running the latest version 1.6.1 against several thousand MarcXML records has identified a common error where certain subfields with leading/trailing spaces cause invalid URIs to be created. Although the error is in the source data, some space trimming in the XSLT could make the process more robust in this area.

Example MarcXML:

  <datafield ind1=" " ind2=" " tag="336">
    <subfield code="a">text</subfield>
    <subfield code="2"> rdacontent</subfield>
    <subfield code="0">http://id.loc.gov/vocabulary/contentTypes/txt</subfield>
  </datafield>

Resultant RDFXML:

    <bf:content>
      <bf:Content>
        <rdfs:label>text</rdfs:label>
        <bf:source>
          <bf:Source rdf:about="http://id.loc.gov/vocabulary/genreFormSchemes/ rdacontent"/>
        </bf:source>
      </bf:Content>
    </bf:content>

The result is that downstream RDF processing complains about invalid URIs, that contain the preserved leading space. I have also seen examples where a trailing space causes the same symptoms.

This is not confined to the 336 tag, it is also apparent for 337, 338.

Looking at the XSL I see there is normalize-space(.) being applied to subfield 'b', could something similar be applied to subfield '2'?

wafschneider commented 2 years ago

@RichardWallis thanks for the report. The issue is corrected on the master branch (and will be in the forthcoming v1.7), and has been backported to the v1.6 branch. I made a new release (v1.6.2) in case that is useful to you, but you should also feel free to use the master branch, or the v1.7-RC, which contains most if not all of the updates that will be part of v1.7 at this time.