Closed ajeanmahoney closed 1 year ago
Thanks @ajeanmahoney , we can certainly adjust this.
@strogonoff @stefanomunarini could you have a look to see if this affects relaton-py? Thanks.
@ronaldtse
Does this mean serialization is wrong, or source data is wrong?
The source should specify IAB as an org or a person consistently
We can special-case this and serialize IAB as a person rather than organization whenever it is encountered
There are definitely consistency issues in the source.
For some RFCs source gives organization: https://github.com/ietf-tools/relaton-data-rfcs/blob/0bdd462f7a9612cb842a08fae995bb738bd6f271/data/RFC1984.yaml#L21-L24
For others source gives a person (note how this is a bit awkward): https://github.com/ietf-tools/relaton-data-rfcs/blob/0bdd462f7a9612cb842a08fae995bb738bd6f271/data/RFC3716.yaml#L21-L46
The source data (rfc-index.xml) is correct in these cases. Note that the source data cannot consistently specify the IAB as an author or an organization because this hasn't been consistent over the history of IAB docs. The source data captures what is shown the RFC header, and the bib data should match what is given in the RFC header.
As per: https://www.rfc-editor.org/rfc/rfc7991
The <organization>
tag is only for:
2.35. <organization>
Specifies the affiliation [RFC7322] of an author.
This information appears both in the "Author's Address" section and
on the front page (see [RFC7322] for more information). If the value
is long, an abbreviated variant can be specified in the "abbrev"
attribute.
This element appears as a child element of <author> (Section 2.7).
Content model: only text content.
So it is only for affiliation.
In this case, the correct format in RFC XML seems to be (according to RFC 7991):
<author fullname="Internet Architecture Board"/>
Because there is no "organization".
We can do the same for all these you've listed:
Is this correct?
yes, thanks!
@ajeanmahoney Thanks! I do have one more question:
@andrew2net pointed out to me that in RFC 7991, this is also written: https://www.rfc-editor.org/rfc/rfc7991#section-2.7
2.7. <author>
...
Note that an "author" can also be just an organization (by not
specifying any of the "name" attributes, but adding the
<organization> child element).
So RFC 7991 does support representing an author as an organization after all, as long as the <author>
element does not contain "name-related" attributes (i.e. asciiFullname, asciiInitials, asciiSurname, full name, initials, surname).
I wonder if it is really impossible to do this:
<author>
<organization>Internet Architecture Board</organization>
</author>
Because this is the cleanest and most semantically accurate way of encoding according to RFC 7991.
PS: We have already had many heated (!!) discussions internally about this topic the past 2 days, and I think we are hoping to address this issue in the cleanest way possible for facilitating future maintenance.
(cc @strogonoff @stefanomunarini @opoudjis)
The information coming from source (i.e., rfc-index.xml) should not be modified if it can be helped. The RPC can clean up source data, and this can propagate through rfc-index.xml. If it's needed later, we could work on a specification that describes what a transform of RFC bib data would look like.
To drive the point a little harder. The way the IAB has represented itself in the past in the RFC series has not been consistent. It is not right to try to rationalize what they did in the past into some single form of representing the reference - we need to report accurately on what happened at the time. The underlying semantics may or may not matter to the IAB that published a given document, but it was the sematic set at the time, and it's not ours to change.
Thanks @ajeanmahoney @rjsparks . Is the following summary correct?
<author><organization>Org Name</organization></author>
<organization>
.Which means we have the following 3 categories of RFC authors.
<author fullname="..." initials="..." surname="..."/>
If with affiliation:
<author fullname="..." initials="..." surname="...">
<organization>Org Name</organization>
</author>
<author>
<organization>Org Name</organization>
</author>
<author fullname="Internet Architecture Board"/>
To make this a bit more concrete. Look at the references in RFC9280, particularly to the reference to RFC5620. In text form:
[RFC5620] Kolkman, O., Ed. and IAB, "RFC Editor Model (Version 1)", RFC 5620, DOI 10.17487/RFC5620, August 2009, https://www.rfc-editor.org/info/rfc5620.
In the published XML:
<reference anchor="RFC5620" target="https://www.rfc-editor.org/info/rfc5620" quoteTitle="true" derivedAnchor="RFC5620">
<front>
<title>RFC Editor Model (Version 1)</title>
<author initials="O." surname="Kolkman" fullname="O. Kolkman" role="editor">
<organization showOnFrontPage="true"/>
</author>
<author>
<organization showOnFrontPage="true">IAB</organization>
</author>
<date year="2009" month="August"/>
<abstract>
<t indent="0">The RFC Editor performs a number of functions that may be carried out by various persons or entities. The RFC Editor model presented in this document divides the responsibilities for the RFC Series into four functions: The RFC Series Editor, the Independent Submission Editor, the RFC Production Center, and the RFC Publisher. It also introduces the RFC Series Advisory Group and an (optional) Independent Submission Stream Editorial Board. The model outlined here is intended to increase flexibility and operational support options, provide for the orderly succession of the RFC Editor, and ensure the continuity of the RFC series, while maintaining RFC quality and timely processing, ensuring document accessibility, reducing costs, and increasing cost transparency. This memo provides information for the Internet community.</t>
</abstract>
</front>
<seriesInfo name="RFC" value="5620"/>
<seriesInfo name="DOI" value="10.17487/RFC5620"/>
</reference>
And I don't think we've used the Undetermined
form above in any published v3 xml for the IAB, and I think it's unlikely that it's been used for anything else that could be recognized as an organization.
If Undetermined did happen, for all practical purposes - it's identical to Person as you describe above.
@rjsparks Thanks. We don't have access to the RPC published XML but this is definitely enlightening.
In the excerpt of RFC 9280 of 5620's BibXML, it shows that IAB is encoded as an <author><organization>
, which @ajeanmahoney has stated in the original post that is counter to their expectations.
<author>
<organization showOnFrontPage="true">IAB</organization>
</author>
The only reason we proposed the "Undetermined" category is that the RPC stated that they do not want to encode "IAB" as an organization for older RFCs (https://github.com/ietf-tools/bibxml-service/issues/296#issuecomment-1247084892), which is what has been done in RFC 9280.
P.S. The above excerpt also raises a question about the contents of the <organization>
tag, as it is empty for the Kolkman reference, but that is irrelevant to our current discussion:
<author initials="O." surname="Kolkman" fullname="O. Kolkman" role="editor">
<organization showOnFrontPage="true"/>
</author>
We don't have access to the RPC published XML but this is definitely enlightening.
Yes you do. It's all in the rfc archive: rsync ftp.rfc-editor.org::rfcs/rfc\*xml
We don't have access to the RPC published XML but this is definitely enlightening.
Yes you do. It's all in the rfc archive:
rsync ftp.rfc-editor.org::rfcs/rfc\*xml
Sorry @rjsparks I missed the message. Now armed with this data, are we saying that we want the BibXML service to obtain RFC authorship from the RFC archive directly?
i.e. for rfc8650.xml
onwards (first RFC in XML there), ignore the content of rfc-index.xml
?
That would be quite a change.
No- you should use rfc-index.xml - I was pointing out that anyone has access to the RPC published XML.
Thanks @rjsparks . So we went through the full list of names available in rfc-index.xml (RFC0001 to RFC9318).
We found several issues that perhaps the RPC (ping @ajeanmahoney ) could address as data issues:
<name> A. Chiu</name>
instead of <name>A. Chiu</name>
<name>H.. Lee</name>
<name>T. Connolly</name>
II
being encoded as an independent name (RFC3789, RFC3790, RFC3791, RFC3792, RFC3793, RFC3794, RFC3795, RFC3796), all for <author><name>P. Nesser</name></author><author><name>II</name></author>
III
encoded as an independent name (RFC9171, RFC9172, RFC9173), all for <author><name>E. Birrane</name></author><author><name>III</name></author>
<name>et al.</name>
for RFC2555, which was really meant as "RFC Editor, et al." <author><name>RFC Editor</name></author><author><name>et al.</name></author>
Other than the above, we filtered all published RFCs and came up with this list of "non-personal names", and we will treat these as the "undetermined" type in BibXML (i.e. only contains fullname
without initials
or surname
):
ACM SIGUCCS
Audio-Video Transport Working Group
Bolt Beranek
Defense Advanced Research Projects Agency
EARN Staff
ESCC X.500/X.400 Task Force
ESnet Site Coordinating Comittee (ESCC)
End-to-End Services Task Force
Energy Sciences Network (ESnet)
Federal Networking Council
Gateway Algorithms and Data Structures Task Force
IAB Advisory Committee
IAB and IESG
IAB
IANA
IESG
IETF Secretariat
ISO
ISOC Board of Trustees
Information Sciences Institute University of Southern California
International Organization for Standardization
International Telegraph and Telephone Consultative Committee of the International Telecommunication Union
Internet Activities Board
Internet Architecture Board
Internet Assigned Numbers Authority (IANA)
Internet Engineering Steering Group
KOI8-U Working Group
Mitra
National Bureau of Standards
National Research Council
National Science Foundation
NetBIOS Working Group in the Defense Advanced Research Projects Agency
Network Information Center. Stanford Research Institute
Network Technical Advisory Group
Newman Laboratories
North American Directory Forum
Sun Microsystems
The Internet Society
The North American Directory Forum
Vietnamese Standardization Working Group
P.S. We have also found several issues in the names of references in published RFC XMLs, but not sure if those are to be reported.
@rjsparks I am confused with this particular comment:
To make this a bit more concrete. Look at the references in RFC9280, particularly to the reference to RFC5620. In text form:
[RFC5620] Kolkman, O., Ed. and IAB, "RFC Editor Model (Version 1)", RFC 5620, DOI 10.17487/RFC5620, August 2009, https://www.rfc-editor.org/info/rfc5620.
In the published XML:
<reference anchor="RFC5620" target="https://www.rfc-editor.org/info/rfc5620" quoteTitle="true" derivedAnchor="RFC5620"> <front> <title>RFC Editor Model (Version 1)</title> <author initials="O." surname="Kolkman" fullname="O. Kolkman" role="editor"> <organization showOnFrontPage="true"/> </author> <author> <organization showOnFrontPage="true">IAB</organization> </author> ... </reference>
This example clearly points to "IAB" being encoded as an organization author.
However, the RPC (@ajeanmahoney above) is specifically asking to NOT represent "IAB" as an organization author (https://github.com/ietf-tools/bibxml-service/issues/296#issuecomment-1248097196).
(Also a note to self that this is RFC 5620 bibliographic data in RFC 9280, and only RFCs after RFC 8650 are available in XML.)
Without further clarification, I understand that the RPC desired behavior is correct and that our proposal satisfies the needs, namely:
<author fullname="Internet Architecture Board"/>
<author fullname="..." initials="..." surname="..."/>
If with affiliation:
<author fullname="..." initials="..." surname="...">
<organization>Org Name</organization>
</author>
If, the intent of this ticket was only to special case IAB and IESG, please help us categorize the full list of non-person authors on whether they are type 1 or type 3 authors.
Thanks.
The comment you point to from Jean is to not change what the rfc-index says, even if treats the IAB differently for different RFCs.
The attempts to normalize the data (the creation of a list to control logic to feed things into "undetermined" which would need to be maintained), and the classification themselves may be too much.
I'll go through the proposal again carefully tomorrow and say more, but what we should be trying to do is simplify, not add complexity.
The crux of the problem is that the RFC XML name model and that of rfc-index.xml are different, which leads to different expectations from BibXML consumers.
For the "author" element, RFC XML expects fullname, initials and surname.
rfc-index.xml only provides the equivalent of fullname.
The RPC expects at least the following:
personal names to fully populate fullname, initials and surname;
the "IAB" and "IESG" related authors to have no initials and surname;
Some organizations to be encoded as organizational authors. ("Some" because we do not have guidance for, which is the point of the proposeal)
Other BibXML consumers expect at least 1 and 3, which is evident from the other tickets raised.
It's not about "not changing", the name model differences require us to make a determination on the nature of the name. And this needs to be done until the RPC adopts RFC XML for its bibliographic information.
So first - focusing on the original request: The essence was to not turn the string "Internet Architecture Board" into "IAB" or vice-versa, but to use whichever variant the rfc-index provides for a given RFC (and it will not be consistent). That should be preserved whether the string gets identified as an organization or not. Please confirm that this part of the request was not lost.
The conversation above quickly focused on the "organization or not" question. The proposal you are working towards is probably the best we can hope for, but there are going to be problematic cases, such as RFC4732 where the text lists the IAB as (semantics inferred) name=Internet Architecture Board
, organization=IAB
. The old xml2rfc.tools.ietf.org bibxml service just represented this as <author><organization>IAB</organization></author>
and that's where we should also aim at the current time. The RPC will go through an effort to groom their database and what they provide in places like rfc-index.xml in the future.
So, reluctantly, I agree that having a list like you have extracted above is necessary, but matches against the list should put the author into the "Organization", not "Undetermined" category. Undetermined should continue to exist as the catch-all.
Nit: Where did you find 'Bolt Baranek', and are you sure it didn't say 'Bolt Baranek and Newman' there?
Currently rfc-index.xml does not provide information on author's organizations or identify organizations as organizations.
The RPC needs to plan to provide this sort of information in the future to the bibxml service, and we will coordinate this data roll out.
at
P.S. We have also found several issues in the names of references in published RFC XMLs, but not sure if those are to be reported.
These should probably be reported as errata, but send an example to the RPC by email to verify that's how they want to ingest the information.
The essence was to not turn the string "Internet Architecture Board" into "IAB" or vice-versa, but to use whichever variant the rfc-index provides for a given RFC (and it will not be consistent). That should be preserved whether the string gets identified as an organization or not. Please confirm that this part of the request was not lost.
This part of the request was handled in this issue:
Nit: Where did you find 'Bolt Baranek', and are you sure it didn't say 'Bolt Baranek and Newman' there?
From rfc-index.xml:
<rfc-entry>
<doc-id>RFC0907</doc-id>
<title>Host Access Protocol specification</title>
<author>
<name>Bolt Beranek</name>
</author>
<author>
<name>Newman Laboratories</name>
</author>
So, reluctantly, I agree that having a list like you have extracted above is necessary, but matches against the list should put the author into the "Organization", not "Undetermined" category. Undetermined should continue to exist as the catch-all.
We are happy to treat all of these as "organizations", but we wish for the RPC to confirm with us which ones are organizations and which ones we should not semantically determine, because as @ajeanmahoney mentioned in this ticket, "IAB" and "IESG" are not to be treated as organizations (they are in the list).
From the list of non-personal authors (which covers all RFCs to date), there are a few categories:
Real organizations:
Bolt Beranek
Defense Advanced Research Projects Agency
Federal Networking Council
ISO
Information Sciences Institute University of Southern California
International Organization for Standardization
International Telegraph and Telephone Consultative Committee of the International Telecommunication Union
Mitra
National Bureau of Standards
National Research Council
National Science Foundation
Network Information Center. Stanford Research Institute
Newman Laboratories
North American Directory Forum
Sun Microsystems
The North American Directory Forum
Groups:
ACM SIGUCCS
Audio-Video Transport Working Group
ESCC X.500/X.400 Task Force
ESnet Site Coordinating Comittee (ESCC)
End-to-End Services Task Force
Energy Sciences Network (ESnet)
Gateway Algorithms and Data Structures Task Force
KOI8-U Working Group
NetBIOS Working Group in the Defense Advanced Research Projects Agency
Network Technical Advisory Group
Vietnamese Standardization Working Group
"Organization-like":
EARN Staff
IETF related (note that "IAB", "IESG", "IAB Advisory Committee" are already stated to be not organizations):
IAB Advisory Committee
IAB and IESG
IAB
IANA
IESG
IETF Secretariat
ISOC Board of Trustees
Internet Activities Board
Internet Architecture Board
Internet Assigned Numbers Authority (IANA)
Internet Engineering Steering Group
The Internet Society
@ajeanmahoney would you mind helping us decide which of the above are to be tagged under <organization>
vs not? Thank you in advance!
So, the list should have "Bert Beranek" and "Newman Laboratories" and "Bert Berenek and Newman Laboratories". So that the right thing happens when the RPC updates the author information for RFC907.
Edit: maybe just the last of the three now - the rfc-index has already been updated.
A quick reskim of https://github.com/ietf-tools/xml2rfc-bibxml/blob/main/bibxml/bibxml-rfcs/gen-bibxml-rfcs-via-rfc-index would probably be helpful here.
@ronaldtse All entries in the lists above ("Real organizations", "Groups", "Organization-like", and "IETF related") should be treated as "organizations" so that we may move forward.
A quick reskim of https://github.com/ietf-tools/xml2rfc-bibxml/blob/main/bibxml/bibxml-rfcs/gen-bibxml-rfcs-via-rfc-index would probably be helpful here.
Thank you @rjsparks , not sure how we missed this all the time!
We will incorporate these as test cases at:
@ronaldtse All entries in the lists above ("Real organizations", "Groups", "Organization-like", and "IETF related") should be treated as "organizations" so that we may move forward.
Thank you @ajeanmahoney !
Apologies for adding to the list of authors that are actually organizations that need special handling, but RFC 2555 should have "RFC Editor, et al." treated as an organization.
From rfc-index.xml:
<rfc-entry>
<doc-id>RFC2555</doc-id>
<title>30 Years of RFCs</title>
<author>
<name>RFC Editor, et al.</name>
</author>
Currently in bibxml:
<reference anchor="RFC2555" target="https://www.rfc-editor.org/info/rfc2555">
<front>
<title>30 Years of RFCs</title>
<author fullname="RFC Editor, et al." initials="RFC" surname="Editor, et al."/>
Perhaps (similar to the handling for RFC 5000):
<reference anchor="RFC2555" target="https://www.rfc-editor.org/info/rfc2555">
<front>
<title>30 Years of RFCs</title>
<author>
<organization>RFC Editor, et al.</organization>
</author>
@ajeanmahoney thank you for suggesting to handle "RFC Editor, et al." as an organization, that makes more sense. We'll add this to the test cases! (added to https://github.com/relaton/relaton-ietf/issues/102)
@ajeanmahoney I believe the issue has been solved with the latest PR #306:
https://bib.ietf.org/get-bibliographic-item/?query=RFC+4089
https://bib.ietf.org/get-bibliographic-item/?query=RFC+3716
If so, could you help verify and close this issue? Thanks!
It has fixed the issue of displaying author name of "Internet Architecture Board" when "IAB" should be displayed. However, there are some RFCs (e.g., RFC 4845) where the author name is given as "Internet Architecture Board" in the document header. References to these RFCs should display "Internet Architecture Board".
@ajeanmahoney I'm a bit confused:
https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4845.xml
does provide:
<author>
<organization abbrev="IAB">Internet Architecture Board</organization>
</author>
Is this incorrect?
@ronaldtse yes, the XML is as shown above. However, if the abbrev attribute is provided and has content, it is used in the reference instead.
In the case where "Internet Architecture Board" is given as the author name in rfc-index.xml, the abbrev attribute should not be used so that "Internet Architecture Board" is displayed.
Based on a conversation with the RPC today, we're going to close this issue.
The major takeaway from having the conversation is that we want to avoid trying to clean/normalize source data as part of this service - we should instead push back on the source data itself.
Describe the issue
(This report should not be confused with #262, which is an issue with a different library)
For 35 RFCs in the IAB Stream, the Internet Architecture Board is listed as an author (not an organization). (Note that the IAB no longer lists itself as an author in its RFCs, so this issue only impacts a subset of older IAB Stream documents.)
For 22 of those RFCs, the author information is presented as the following in rfc-index.xml:
For 11 of those RFCs, the author information in rfc-index.xml is the following:
In the bib information returned from bib.ietf.org, this has been transformed into
Author information in a reference should match author information provided in the header of the RFC. For these IAB Stream documents, if rfc-index.xml specifies "IAB" as an author, then this should also be used in the bib data. If rfc-index.xml uses "Internet Architecture Board" for an author name, then this should be used in the bib data. This will let the bib data more closely match the information in the RFC header.
There are also a small number of documents that list the IESG (5 documents) or the Internet Engineering Steering Group (2 documents) as author. This information also needs to be passed though unmodified.
Note: The following are unique to the RFC series, but using the bib-index.xml author information should work:
RFC 3716 rfc-index.xml:
<author><name>IAB Advisory Committee</author></name>
bib.ietf.org:<author fullname="IAB Advisory Committee" initials="IAB" surname="Advisory Committee"/>
RFC 4089 rfc-index.xml:
<author><name>IAB and IESG</author></name>
bib.ietf.org:<author fullname="IAB and IESG" initials="IAB" surname="and IESG"/>)
Code of Conduct