lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
89 stars 35 forks source link

Eliminate trailing space on some OCLC numbers #76

Closed kiegel closed 6 years ago

kiegel commented 6 years ago

We find that when OCLC numbers with prefixes in field 035 are converted, they have a trailing space. This does not happen with OCLC numbers without a prefix.

035 __ |a (OCoLC)ocm04212209

bf:identifiedBy [ a bf:Local ;
            bf:source [ a bf:Source ;
                    rdfs:label "OCoLC" ] ;
            rdf:value "ocm04212209 " ],

Please eliminate this space, e.g. by using normalize-space() on the output.

The space causes problems for us when we query OCLC numbers in SPARQL. Typically, BF triples for a converted record contain multiple instances of the OCLC number, often with and without a prefix. This causes duplicate lines in the output, and when we try to use DISTINCT to eliminate dups, this fails. It is easy enough to remove the prefix, which is rightly part of the data string, but the trailing space causes problems. As strings, "04212209" and "04212209 " are not identical and won't de-dup. Since the trailing space is not part of the data, it would be best to remove it during conversion.

kirkhess commented 6 years ago

I can't reproduce this and the chopPunctuation template already removes a trailing space so that's not the solution. How are you executing the converter? Also, you posted Turtle, are you sure this is in the RDF/XML output?

Note: there's a unit test for 035$a so I changed the data to use your example above (see /test/data/ConvSpec-010-048/marc.xml) and ran the 035 scenario in /test/ConvSpec-010-048.xspec looking for the new value "ocm04212209" (no trailing space) and it passed.

xspec uses Saxon9he.jar - I did noticed other chopPunctuation call-templates have this line:

:,;/ while the 035 one is relying on the default value of that parameter (which is the same text value). You could go in ConvSpec-010-048.xsl, go to line 935, press enter and paste that into line 936 and see if it makes a difference.
kiegel commented 6 years ago

I convert using Oxygen with Saxon-PE 9.6.0.7. The trailing space is in the RDF/XML, not an artifact of the conversion to Turtle.

I'm not following how you want me to test. Open a new line after 935 and paste in "xsl:text:,;/ </xsl:text></xsl:with-param>"? Or replace line 935 with this line?

kirkhess commented 6 years ago

The first, new line, paste in that with-param. In any case I found it (http://id.loc.gov/tools/bibframe/compare-lccn/full-rdf?find=36010426), we don't have the prefix.

Downloaded, changed the value and no trailing space.

If I manually add a trailing space it doesn't remove it in Oxygen, which is kind of odd. I'll have Wayne check that out.

kirkhess commented 6 years ago

Line 918 in marc2bibframe2/xsl/ConvSpec-010-048.xsl <xsl:param name="pChopPunct" select="false()"/>

Change that to true() and it will remove trailing punctuation incl. spaces.