Schema change for NIST-Tech-Pubs repository `allrecords.xml`

ronaldtse commented 6 months ago

As per @kmiller621:

https://github.com/usnistgov/NIST-Tech-Pubs/issues/42

The NIST-Tech-Pubs repository will now be published in MARC21 XML and also MODS.

The MODS schema works easier because it doesn't require us to parse MARC codes which are numerous and prone to errors (from my perspective).

From @andrew2net :

Hi Ronald. The NIST allrecords dataset isn't available anymore https://raw.githubusercontent.com/usnistgov/NIST-Tech-Pubs/nist-pages/xml/allrecords.xml

This task is to be done with:

https://github.com/relaton/loc-mods/issues/3

andrew2net commented 6 months ago

@ronaldtse where should we download the new data?

ronaldtse commented 6 months ago

It’s on the releases page of that repository. It was mentioned in Katherine’s post 😅

andrew2net commented 6 months ago

@ronaldtse ok, I see the dataset on the release page. But how it's supposed to update? Will it be a new release page?

andrew2net commented 6 months ago

@ronaldtse it seems relatedItem[@type="series"] should not go to bibitem/relation, should it?

      ...
      <location>
         <url displayLabel="electronic resource" usage="primary display">https://doi.org/10.6028/NIST.TN.2025r1</url>
      </location>
      <relatedItem type="preceding" otherType="Supersedes" displayLabel="Supersedes">
         <name>
            <namePart>10.6028/NIST.TN.2025</namePart>
         </name>
      </relatedItem>
      <relatedItem type="series">
         <titleInfo>
            <title>NIST technical note; NIST tech note; NIST TN</title>
            <partNumber>2025r1</partNumber>
         </titleInfo>
      </relatedItem>
      ...

ronaldtse commented 6 months ago

relatedItem[@type="series"] should not go to bibitem/relation

There are pros and cons:

A series is needed for rendering the citation. So maybe it is a mandatory element and differs from other related items. - However, we can also make bibitem/relation mandatory to contain the series.
The structure of a series item and other related items are identical, so they should not be stored in different places.

It might be best if it is stored there. What do you think? The thinking is that a bibitem can still be used on its own even without any relatedItems.

Ping @opoudjis for thoughts.

andrew2net commented 6 months ago

@ronaldtse I think the relatedItem[@type="series"] elements look like series. Their titles have series names and sometimes series abbreviations. It seems we need to parse the titles and create series, don't we?

ronaldtse commented 6 months ago

@andrew2net I agree. But that means we can keep series in bibitem/relation, right?

opoudjis commented 6 months ago

Excuse me, but are you unaware @ronaldtse that we already have bibitem/series? If you are not unaware, then how is this justifiable semantically? What kind of relation in our existing taxonomy is there between the bibliographic item and the series? The list of possible relations is in https://www.relaton.org/model/relations/

You are not putting up cons. The con is that it is nonsense. A publication is not in a part-whole relation with a series, a publication uses series as a secondary identifier. The "but it is convenient to preserve structure" and "series look like publications" arguments are ugly hackery.

We have in the example above:

 <relatedItem type="series">
         <titleInfo>
            <title>NIST technical note; NIST tech note; NIST TN</title>
            <partNumber>2025r1</partNumber>
         </titleInfo>
      </relatedItem>

This DOES NOT look like

<relatedItem type="preceding" otherType="Supersedes" displayLabel="Supersedes">
         <name>
            <namePart>10.6028/NIST.TN.2025</namePart>
         </name>
      </relatedItem>

And bibitem/series has the structure:

series = element series {
  attribute type { SeriesType }?,
  formattedref?,
  btitle, bplace?, seriesorganization?,
  abbreviation?,
  seriesfrom?, seriesto?,
  seriesnumber?, seriespartnumber?, seriesrun?
}

Which means

relatedItem[@type = 'series']/titleInfo/title => series/title
relatedItem[@type = 'series']/titleInfo/partNumber => series/number

If NIST choose to conflate series and related items, that's their semantic confusion. It's not like a schema which puts three synonyms (!!) inside its titleInfo/title, NIST technical note; NIST tech note; NIST TN, is demonstrating good design. In this case, "NIST TN" = series/abbreviation. I may allow multiple titles for series; this has come up recently already.

But I do not see why bibitem/relation makes any sense at all for series. If we accept that series are bibliographic items, which is questionable, we would need to make it bibitem/relation[@type = 'includedIn'][description = 'series']; but we already HAVE bibitem/series. Going out of our way to avoid the correct semantic element is obfuscation, and it will lead to bad rendering: relaton-render will have to deal with this absurdity, and it should not have to look for series under bibitem/relation when it expects to find them in bibitem/series.

Your argument for using bibitem/relation is... I don't even know what it is. Laziness, wanting to recycle bibitem/relation/bibitem? But the savings in recycling it are offset by the confusion by anyone actually using it.

Also:

A series is needed for rendering the citation. So maybe it is a mandatory element and differs from other related items. - However, we can also make bibitem/relation mandatory to contain the series.

Not in my grammar you won't. NIST Tech Pubs repository is not the only or even the primary user of relaton grammars. And you can't make bibitem/relation mandatory for one type. If you're going to make things mandatory in a schema profile specific to NIST Tech Pubs, you want a discrete element like series, not a catchall element like relation.

The structure of a series item and other related items are identical, so they should not be stored in different places.

That is an argument for doing away with series as an element in general, which I reject, and which you yourself Ronald did not present in ISO 690.

Step away from this madness. If you insist on doing this, I want nothing to do with it, but I will expect that any insane nonsensical lazy-arsed unmotivated encoding of bibitem/relation[@type = 'includedIn'][description = 'series'] is shadowed by the semantically correct bibitem/series.

andrew2net commented 6 months ago

@ronaldtse Most of name elements have persona or corporate type. They become person or organization contributors respectively. But a few name elements have conference type. How should we handle them?

      ...
      <name type="conference">
         <namePart>PerMIS Workshop (2012 : Gaithersburg, MD)</namePart>
      </name>
      <name type="personal">
         <namePart>Marvel, Jeremy.</namePart>
      </name>
      <name xmlns:xlink="http://www.w3.org/1999/xlink"
            type="corporate"
            xlink:href="https://id.loc.gov/authorities/names/n88112126">
         <namePart>National Institute of Standards and Technology (U.S.)</namePart>
         <nameIdentifier>https://id.loc.gov/authorities/names/n88112126</nameIdentifier>
      </name>
      ...

opoudjis commented 6 months ago

Yuck. Conferences are not authors, they are occasions on which work is presented and gets feedback. But if they insist on modelling them this way, they are organisations: a conference authoring a paper is an ad hoc kind of group of people brought together to do a task, which is pretty much an organisation.

If you need to differentiate between corporations and conferences, we don't have organisation types right now, and I'm not convinced we'd need them, but I'd refine this as /contributor/role[@type = 'author'][description = 'conference']. I would also argue that the conference is properly the authoriser of the work, /contributor/role[@type = 'authorizer'], but that's more complicated than this deserves.

ronaldtse commented 6 months ago

Wait wait. They are "venues"? I don't think they are "contributors"...

opoudjis commented 6 months ago

I assumed they were being put forward as authors, and that was why the choice of person vs organisation was needed.

If they are conference venues... crap. We haven't modelled them in relaton yet for proceedings and conference papers, though of course the modelling is there for ISO 690. We would need to track place + date + conference name and maybe series; there's no easy container in relaton for all of that right now. (Series is the closest, but not close enough.)

Relaton does not cover all of ISO 690, though it does cover a lot of it; this is probably the biggest gap germane to the kinds of bibliographic items we are likely to deal with.

Do we need to for this for this database? If so, I am pretty sure this is not something @andrew2net wants to tackle immediately.

ronaldtse commented 6 months ago

@andrew2net can we ignore the "conference" issues for now? Let's make that a new ticket and not let this block the migration. Thanks.

andrew2net commented 6 months ago

@ronaldtse the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed. In addition, there are duplications in the MODS dataset. The IDs are fetched from location/url

...
<location>
    <url displayLabel="electronic resource" usage="primary display">https://doi.org/10.6028/NBS.LCIRC.108</url>
</location>
...

so there are two 10.6028/NBS.LCIRC.108, two 10.6028/NBS.SP.535v2m-z, and many other duplications. Should we release the new dataset in production or do we need to resolve these issues first?

ronaldtse commented 6 months ago

the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed.

So either database has records missing? MODS contains something that the old allrecords.xml doesn't have?

Duplicates

I checked 10.6028/NBS.LCIRC.108, the 2 entries are likely encoded by different people on different dates. The data is slightly different in completeness.

One:

      <name type="personal" usage="primary">
         <namePart>Miller, D. R.</namePart>
      </name>
      <name type="personal">
         <namePart>Miller, D. R.</namePart>
      </name>
      <name type="personal">
         <namePart>Fullmer, I. H.</namePart>
      </name>
...
      <note type="statement of responsibility">D.R. Miller ; I.H. Fullmer.</note>
      <note>Title from PDF title page (viewed September 27, 2018).</note>
      <identifier type="oclc">1054788303</identifier>

Two:

      <name type="personal" usage="primary">
         <namePart>"Miller, D. R."</namePart>
      </name>
      <name type="personal">
         <namePart>"Miller, D. R."</namePart>
      </name>
...
      <note type="statement of responsibility">D.R. Miller ; I.H. Fullmer.</note>
      <note>"Title from PDF title page (viewed June 7, 2017)."</note>
      <identifier type="oclc">994612599</identifier>

Clearly the "newer" entry (the one with a later ID) is correct because the older entry missed one of the author's name.

So for duplicates, let's use the newer entry.

andrew2net commented 6 months ago

So either database has records missing? MODS contains something that the old allrecords.xml doesn't have?

here is the list of the differences diff.txt

Lines that are only in the old dataset are prefixed with <.
Lines that are only in the new dataset are prefixed with >.
If a file name has changed, it will be shown as the file from the old dataset and then the file from the new dataset.
Each change is preceded by the line numbers that apply to that change. The format is start,end for ranges of lines, or just a single number for single lines. The line numbers for the first file and the second file are separated by c (for changed), a (for added), or d (for deleted).

andrew2net commented 4 months ago

@ronaldtse Also the new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical.

Here are the URLs of duplicated docs: log.txt

Should we move forward with all these issues or wait until NIST fixes them?

ronaldtse commented 4 months ago

@andrew2net we will need to move forward first, and then let's file issues at the NIST repository on what went missing.

andrew2net commented 4 months ago

New schema parser implemented in v1.19.1

relaton / relaton-nist

Schema change for NIST-Tech-Pubs repository `allrecords.xml` #112