ietf-tools / bibxml-service

Django-based Web service implementing IETF BibXML APIs
https://bib.ietf.org
BSD 3-Clause "New" or "Revised" License
16 stars 19 forks source link

DOI references not matching RFC 2629 DTD #228

Open ronaldtse opened 2 years ago

ronaldtse commented 2 years ago

From @kesara https://github.com/ietf-tools/xml2rfc/pull/804#issuecomment-1175684226

Tests are failing because reference.DOI.10.1145/2975159 doesn't have date element under front element. This violates rfc2629.dtd.

Originally posted by @kesara in https://github.com/ietf-tools/xml2rfc/issues/804#issuecomment-1175684226

ronaldtse commented 2 years ago

@stefanomunarini can we add tests to validate bibitems (selection of tests across all datasets) against the BibXML schema?

kesara commented 2 years ago

bibxml-service reference: https://bib.ietf.org/public/rfc/bibxml-doi/reference.DOI.10.1145/2975159.xml current tools.ietf.org reference: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-doi/reference.DOI.10.1145/2975159.xml

New bibxml-service output's title is incomplete. Also it lacks <seriesInfo name="Communications of the ACM" value="Vol. 59, pp. 88-97"/>.

ronaldtse commented 2 years ago

Incomplete title

Original: "Jupiter rising: a decade of clos topologies and centralized control in Google's datacenter network" New: "Jupiter rising"

Right, this needs to be fixed (@strogonoff ). I think it may be fixed by #215 (@stefanomunarini ).

<seriesInfo name="Communications of the ACM" value="Vol. 59, pp. 88-97"/>.

@kesara while this could be useful, in <seriesInfo>, the "name" attribute value is explicitly invalid according to RFC 7991:

2.47.3.  "name" Attribute (Mandatory)

   The name of the series.  The currently known values are "RFC",
   "Internet-Draft", and "DOI".  The RFC Series Editor may change this
   list in the future.
strogonoff commented 2 years ago

Can anyone point to preexisting IETF’s xml2rfc tools Crossref API handler (i.e., what code runs under /public/rfc/bibxml-doi/)? https://github.com/ietf-tools/xml2rfc-bibxml doesn’t seem to have it🤔

rjsparks commented 2 years ago

What you're looking for is in the RFP, in the section for bibxml7.

ronaldtse commented 2 years ago

@strogonoff the bibxml-doi code is here: https://github.com/ietf-tools/xml2rfc-website/tree/56c0be788c4fd22ae475302dcd399439815927f0/public/rfc/bibxml-doi

It uses doilit, which we have already reimplemented: https://github.com/ietf-tools/xml2rfc-website/blob/56c0be788c4fd22ae475302dcd399439815927f0/public/rfc/bibxml-doi/nph-index.cgi#L189

ronaldtse commented 2 years ago

I'm a little perplexed: our doi2ietf already implements dates but why is not serialised into BibXML?

Yes we need to adopt the dates from the Crossref API and map them to the Relaton model.

Relaton supports these date/time types:

Crossref metadata includes the following date/times:

  • <title>: it looks like IETF xml2rfc tools concatenated title and subtitle using a colon. Relaton-py could do that if that’s reliable. Currently, relaton-py’s bibxml serializer doesn’t do any such title adaptation and ends up using the first available title when serializing to BibXML. We can either change BibXML serialization in relaton-py, or change the way we format the main title when parsing Crossref data in bibxml-service.

We should concatenate the Crossref title and subtitle at the doi2ietf level.

  • “Communications of the ACM” is apparently taken from container-title, we could use that when creating a bibliographic item from Crossref data if that’s always how it should be parsed.

As I pointed out in https://github.com/ietf-ribose/bibxml-service/issues/228#issuecomment-1175771552 , we really want explicit permission from @rjsparks that this is correct usage of <seriesInfo>. Thanks.

strogonoff commented 2 years ago

@ronaldtse

our doi2ietf already implements dates but why is not serialised into BibXML?

We are not using doi2ietf for at least these two reasons:

  1. doi2ietf-py used obsolete, unsupported Crossref API to retrieve data.
  2. doi2ietf-py’s purpose was to transform Crossref API data directly into BibXML, not to Relaton. We don’t need that; we need to transform Crossref to Relaton, and use relaton-py’s bibxml serializer after that.

With that in mind, it was faster to bypass doi2ietf-py and implement this directly in bibxml-service and relaton-py.

Yes we need to adopt the dates from the Crossref API and map them to the Relaton model.

Yes, @stefanomunarini’s PRs should take care of all that. It’s aimed to port the requisite functionality from doi2ietf-py into both bibxml-service Crossref DOI parser and relaton-py serializer. I’ll merge them once we confirm that new <seriesInfo name> values are acceptable, because it contains that as well.

rjsparks commented 2 years ago

@ronaldtse It is expected that seriesinfo will have more than the 3 possible names listed in 7991. We will make sure that gets clarified in 7991bis. A better thing to read at the moment is the seriesInfo entry at https://authors.ietf.org/en/rfcxml-vocabulary

ajeanmahoney commented 2 years ago

Note that the RPC uses seriesInfo for documents that are part of a series and have a unique value. Examples of document series include RFC, IEEE Std, ITU Recommendation, DOI, 3GPP TR, 3GPP TS, ISO/IEC, and FIPS PUB. The RPC uses refcontent to capture journal or conference proceedings information: journal or conference title, volumes, pages, conference location, etc. For example,

<refcontent>Communications of the ACM, Vol. 59, pp. 88-97</refcontent>
ronaldtse commented 2 years ago

Thanks @rjsparks @ajeanmahoney .

Valid values for <seriesInfo> "name"

Is the seriesInfo value from a controlled vocabulary or free form text? If the former, it would be great to have the specifications.

https://authors.ietf.org/en/rfcxml-vocabulary seems to describe the "name" attribute as the name of the standardization organization outside of IETF ("other names such as "ISO", "W3C" for exist for other standardisation organisations")

Screenshot 2022-07-08 at 1 03 40 PM

Is "name" supposed to take the "series name" or the "organization name"?

From the illustrative list provided it looks like it is the "series name" (which makes sense given the element name), not the "organization name".

Some question regarding the example list:

  1. I understand the separation of "3GPP TS" and "3GPP TR". Are "ISO/IEC", "ISO/IEC TR" and "ISO/IEC TS" also separated (and there are other deliverable types as well)?
  2. IEEE offers other deliverable types that are not standards, such as "Recommended Practices" and "Guidelines". Should they be considered series?
  3. "ITU Recommendations" are published as "ITU-T Recommendations" and "ITU-R Recommendations". They also have a dozen deliverables types. Are they supported?
  4. NIST and W3C are also supported by bibxml-service.

Proper structuring of a DOI entry in BibXML

The item in question has source metadata provided through this Crossref link:

Notice that "Communications of the ACM" exist in container-title.

As specified by @ajeanmahoney , this information is to be in <refcontent>, not <seriesInfo>, and should look like this:

<refcontent>Communications of the ACM, Vol. 59, pp. 88-97</refcontent>

This formatted reference string can only be built from the raw Crossref metadata, by also including these elements:

"page":"88-97",
"volume":"59"

I would like to confirm with @ajeanmahoney that:

  1. Every BibXML item generated from DOI will be using refcontent, not seriesInfo.
  2. We will programmatically construct the formatted reference string in refcontent using Crossref metadata. This is about citation rendering.

Thanks!

ajeanmahoney commented 2 years ago

seriesInfo name and value attributes take freeform text. The name attribute holds the name of the series. The RPC uses the following seriesInfo names:

These are what we have identified so far. We will be discussing this list this week.

ronaldtse commented 2 years ago

Thanks @ajeanmahoney , since there's going to be a discussion if you don't mind let us provide some additional input 😉

Basis:

Questions:

  1. I believe seriesInfo name should support all series that BibXML service supports today (as part of the ietf-tools suite), including those published by the following organizations:
    • 3GPP
    • IEEE
    • NIST
    • W3C
    • IANA
  2. Consider whether seriesInfo name (for organizations external to IAB/IETF) represents the name of the SDO, or a document type of the SDO. Developers and users would certainly prefer a consistent application. Amongst values supported today:
    • document types: 3GPP TS, 3GPP TR, ITU Recommendation, IEEE Std
    • organization name: ISO/IEC
    • series name: FIPS PUB (published by the Department of Commerce as executed by NIST)

Thanks!

strogonoff commented 2 years ago

Tests are failing because reference.DOI.10.1145/2975159 doesn't have date element under front element. This violates rfc2629.dtd.

Originally posted by @kesara in ietf-tools/xml2rfc#804 (comment)

Can I clarify where is <date> required? It’s not in this spec.

@ronaldtse While this particular issue may have been resolved, since we can rely on DOI to provide at least one date, we cannot be so sure with some other sources.

For example, we have recently found that some 3GPP documents are lacking dates, and this may be the case with other sources.

ajeanmahoney commented 2 years ago

There are some cases where a date is never provided in a bib entry (IANA registry entries, for instance).

Sometimes, an author points to a landing page for a spec (a 3GPP or IEEE entry may fall into this category). Those kind of entries don't have dates. I haven't looked to see if the bibxml-service datastore contains landing-page references.

rjsparks commented 2 years ago

refererences without dates are syntactically legal and appropriate in cases like Jean calls out above. But when the document does have a publication date (as the original DOI the ticket was opened with), the date must be provided, well formed, in the reference.

rjsparks commented 2 years ago

I think I've pointed this out in other places, but rfc2629.dtd is not v3 rfcxml - it is strict v2, and while we want to be v2 backwards compatible as much as we can be, there are many RFCs in the v2 era that were published with references that didn't contain dates. In short, date cannot be treated as a mandatory element here.