relaton / relaton-ietf

RFCBib: retrieve RFC Standards for bibliographic use using the BibliographicItem model
BSD 2-Clause "Simplified" License
2 stars 0 forks source link

`formatted_initials` without content #106

Closed strogonoff closed 1 year ago

strogonoff commented 2 years ago

So far noticed only in draft-andersen-arc-01.yaml.

ronaldtse commented 2 years ago

There's not even a name inside this block:

    name:
      given:
        formatted_initials:
          language:
          - en
      surname:
        content: None
        language:
        - en
      completename:
        content: None
        language:
        - en
ronaldtse commented 2 years ago

The root cause is this:

In the original RFC XML, there is an "author" that is an organization, not a person: https://www.ietf.org/archive/id/draft-andersen-arc-01.xml

  <front>
    <title abbrev="ARC">Authenticated Received Chain (ARC)</title>

    <author initials="." surname="OAR-DEV Group">
      <organization>OAR-DEV Group</organization>
      <address>
        <email>arc-discuss@dmarc.org</email>
      </address>
    </author>
    <author initials="K." surname="Andersen" fullname="Kurt Andersen">
      <organization>LinkedIn</organization>
      <address>
        <postal>
          <street>2029 Stierlin Ct.</street>
          <city>Mountain View</city>
          <region>California</region>
          <code>94043</code>
          <country>USA</country>
        </postal>
        <email>kurta@linkedin.com</email>
      </address>
    </author>

There is no way for us to tell this is not a person, because it has initials and surname.

Can I propose the correct outcome to be the following?

    name:
      given:
        formatted_initials:
          content: .
          language:
          - en
      surname:
        content: OAR-DEV Group
        language:
        - en

Adjustments welcome.

strogonoff commented 2 years ago

There's not even a name inside this block:

    name:
      given:
        formatted_initials:
          language:
          - en
      surname:
        content: None
        language:
        - en
      completename:
        content: None
        language:
        - en

@ronaldtse I did raise the issue only about the schema deliberately. Schema mismatch may break data loaders/deserializers (also raises the question as to how did this pass through serialization mechanism without failures, apparently some code doesn’t validate that formatted_initials is a valid formatted string?)

Unlike the schema, the issue with the data is separate, it doesn’t break anything and can be fixed at any time…

But if we are talking about data, why do you include given name and formatted_initials in your output since it seems like you only put there a full stop as a placeholder? I think it can be omitted and we could just have surname and completename left.

      given:
        formatted_initials:
          content: .
          language:
          - en
ronaldtse commented 2 years ago

why do you include given name and formatted_initials in your output since it seems like you only put there a full stop as a placeholder? I think it can be omitted and we could just have surname and completename left.

@strogonoff could you re-read my original message? There is some misunderstanding here. This question makes no sense. The fullstop is from original data.

andrew2net commented 2 years ago

@ronaldtse we use the command rsync -avcizxL rsync.ietf.org::bibxml-ids ./bibxml-ids to get source files. The content of the reference.I-D.draft-andersen-arc-01.xml source file is:

<?xml version="1.0" encoding="UTF-8"?>
<reference anchor="I-D.andersen-arc">
   <front>
      <title>Authenticated Received Chain (ARC)</title>
      <author initials="" surname="None" fullname="None">
         </author>
      <author initials="K." surname="Andersen" fullname="Kurt Andersen">
         </author>
      <author initials="J." surname="Rae-Grant" fullname="John Rae-Grant">
         </author>
      <author initials="B." surname="Long" fullname="Brandon Long">
         </author>
      <author initials="J. T." surname="Adams" fullname="J. Trent Adams">
         </author>
      <author initials="S. M." surname="Jones" fullname="Steven M Jones">
         </author>
      <date month="February" day="1" year="2016" />
      <abstract>
     <t>   Authenticated Received Chain (ARC) permits an organization which is
   creating or handling email to indicate their involvement with the
   handling process by adding a cryptographically signed header (or
   headers) in a manner analagous to that of DomainKeys Identified Mail
   (DKIM).  Assertion of responsibility is validated through a
   cryptographic signature and by querying the Signer&#39;s domain directly
   to retrieve the appropriate public key.  Changes in the message which
   may break DKIM, may be tracked through the ARC set of headers.

     </t>
      </abstract>
   </front>
   <seriesInfo name="Internet-Draft" value="draft-andersen-arc-01" />
   <format type="TXT" target="https://www.ietf.org/archive/id/draft-andersen-arc-01.txt" />
</reference>
andrew2net commented 1 year ago

fixed https://github.com/ietf-tools/relaton-data-ids/blob/cd542f8c3de75ef75a83efd3ea93a6558bd24bc9/data/draft-andersen-arc-01.yaml#L27