lcnetdev / marc2bibframe2

Convert MARC records to BIBFRAME2 RDF
http://www.loc.gov/bibframe/
Creative Commons Zero v1.0 Universal
88 stars 35 forks source link

Duplicated ISSNs and series titles in hasSeries statements #47

Closed osma closed 7 years ago

osma commented 7 years ago

I'm using the newly released v1.2.0 version of marc2bibframe2. When converting this 830 series statement (full record here with the local identifier 000419615):

    <marc:datafield tag="830" ind1=" " ind2="0">
      <marc:subfield code="a">Braille-delegationens publikationer,</marc:subfield>
      <marc:subfield code="x">1457-6589 ;</marc:subfield>
      <marc:subfield code="v">2.</marc:subfield>
    </marc:datafield>

I get this output:

    <bf:hasSeries>
      <bf:Instance>
        <rdfs:label>Braille-delegationens publikationer</rdfs:label>
        <bf:seriesStatement>Braille-delegationens publikationer, 2.</bf:seriesStatement>
        <bf:seriesEnumeration>2.</bf:seriesEnumeration>
        <bf:identifiedBy>
          <bf:Issn>
            <rdf:value>1457-6589</rdf:value>
          </bf:Issn>
        </bf:identifiedBy>
        <bf:instanceOf>
          <bf:Work rdf:about="http://urn.fi/URN:NBN:fi:bib:me:000419615#Work830-35">
            <rdfs:label>Braille-delegationens publikationer,</rdfs:label>
            <bf:title>
              <bf:Title>
                <bflc:title30MatchKey>Braille-delegationens publikationer,</bflc:title30MatchKey>
                <bflc:title30MarcKey>830 0$aBraille-delegationens publikationer,$x1457-6589 ;$v2.</bflc:title30MarcKey>
                <rdfs:label>Braille-delegationens publikationer,</rdfs:label>
                <bflc:titleSortKey>Braille-delegationens publikationer,</bflc:titleSortKey>
                <bf:mainTitle>Braille-delegationens publikationer</bf:mainTitle>
              </bf:Title>
            </bf:title>
            <bf:identifiedBy>
              <bf:Issn>
                <rdf:value>1457-6589</rdf:value>
              </bf:Issn>
            </bf:identifiedBy>
            <bf:identifiedBy>
              <bf:Issn>
                <rdf:value>1457-6589</rdf:value>
              </bf:Issn>
            </bf:identifiedBy>
          </bf:Work>
        </bf:instanceOf>
      </bf:Instance>
    </bf:hasSeries>

I think the output shows excessive duplication. The ISSN (1457-6589) is repeated three times, once for the series Instance and twice for the series Work. I'm not sure what your spec says, but at least one of the Work ISSNs is completely redundant here.

Based on a very limited understanding of series and ISSNs, I would think that ISSNs would be useful mainly for the Instance. On the other hand, some ISSNs are also ISSN-L:s (though you can't tell it just from the syntax, they look the same, and in fact in many cases are the same as regular ISSNs) which could be useful as work-level identifiers too. So maybe it makes sense to repeat the ISSN both for the Instance and the Work, but definitely not twice for the same Work.

There's also quite a lot of duplication in the different way the title of the series ("Braille-delegationens publikationer") is expressed. In the original MARC record, it appeared just once, but in the BIBFRAME output, it appears a whopping 8 times in various forms (series statements, sort keys etc). Probably there's some reason behind all of them, but from a data modelling perspective, repeating the same information so many times seems like a bad idea, especially if you're planning on maintaining the data as RDF going forward.

wafschneider commented 7 years ago

Thanks, @osma. You're right, the ISSN repeating in the Work record is definitely an implementation bug. It is deliberate in the spec to place the ISSN on both the Work and the Instance, however (your explanation as to why is better than I could have come up with -- perhaps @kirkhess would like to comment?). Similarly, the many properties containing variations on the title string are as spec'ed -- it probably makes a little more sense with a series entry that has more subfields, as not all of them get used in the same properties, though of course $a is in almost every property. For the updated series processing specs, see http://www.loc.gov/bibframe/mtbf/ConvSpec-Process0-3,6-R1p.docx

osma commented 7 years ago

Wow that was fast! And you even made a new release! Thanks a lot @wafschneider!

osma commented 7 years ago

Regarding the duplication of series titles etc., I think that is bad design, but the problem is with the spec, not the conversion code. Redundant information should be avoided. If there is a need for indexes, sort keys etc. in a particular application, then those should be kept separate from the actual data, probably not even as RDF triples at all but in some auxiliary structure (e.g. a Lucene/Solr/Elasticsearch index), or at least not in the same graph.

wafschneider commented 7 years ago

The power of embarrassment :-) -- glad you caught the bug, @osma!