ietf-tools / relaton-data-ids

Bibliographic data information for Internet-Drafts in Relaton format
7 stars 10 forks source link

Setup GHA automated pulls from official IETF Internet-Drafts dataset #1

Closed ronaldtse closed 2 years ago

ronaldtse commented 2 years ago

Available now at https://www.ietf.org/lib/dt/sprint/bibxml3.tgz

This is a ~46MB file that expands to ~270MB containing all the Internet-Draft bibliographic entries.

We need to make these available via the BibXML indexer / service for searching.

(Thank you @rjsparks )

andrew2net commented 2 years ago

@ronaldtse the files in the archive are already in the BibXML format. Do we need to convert these documents to YAML and RelatonXML? If not we can just save them in this repo.

ronaldtse commented 2 years ago

@andrew2net yes let's use the Relaton YAML format.

andrew2net commented 2 years ago

@ronaldtse I faced a file (reference.I-D.draft-ietf-quic-manageability-04.xml) with an error in date:

<?xml version="1.0" encoding="UTF-8"?>
<reference anchor="I-D.ietf-quic-manageability">
   <front>
      <title>Manageability of the QUIC Transport Protocol</title>
      <author initials="M." surname="Kühlewind" fullname="Mirja Kühlewind">
         <organization>ETH Zurich</organization>
      </author>
      <author initials="B." surname="Trammell" fullname="Brian Trammell">
         <organization>ETH Zurich</organization>
      </author>
      <date month="** No value found for &#39;doc.date&#39; **" day="** No value found for &#39;doc.date.day&#39; **" year="** No value found for &#39;doc.date.year&#39; **" />
      <abstract>
     <t>   This document discusses manageability of the QUIC transport protocol,
   focusing on caveats impacting network operations involving QUIC
   traffic.  Its intended audience is network operators, as well as
   content providers that rely on the use of QUIC-aware middleboxes,
   e.g. for load balancing.

     </t>
      </abstract>
   </front>
   <seriesInfo name="Internet-Draft" value="draft-ietf-quic-manageability-04" />
   <format type="TXT" target="https://www.ietf.org/archive/id/draft-ietf-quic-manageability-04.txt" />
</reference>

I can drop the data but shouldn't we message the publisher?

rjsparks commented 2 years ago

Bah - my fault on generating the dump (using a dev instance instead of the production instance)

The next version I generate for you will not have that, but, it will look like <date month="", day="", year="">.

I'll look at the record for it to see why the date is empty.

rjsparks commented 2 years ago

As I am reworking the source you can resync-from I'm making a decision.

Note that all four of these URLs work (and retrieve the same content). The difference is the presence or absence of the 'draft-' prefix, and the version number. The community is accustomed to all four of these variants working, so I will generate all of them for the current version of drafts in the dataset I make for you to sync from, and will plan to provide them all for any new versions (or perhaps we could get the api to recognize that it should make four things given the draft name and version so the datatracker only has to send the content bits once).

https://xml2rfc.tools.ietf.org/public/rfc/bibxml-ids/reference.I-D.sparks-sipcore-multiple-reasons.xml https://xml2rfc.tools.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-sparks-sipcore-multiple-reasons.xml https://xml2rfc.tools.ietf.org/public/rfc/bibxml-ids/reference.I-D.sparks-sipcore-multiple-reasons-00.xml https://xml2rfc.tools.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-sparks-sipcore-multiple-reasons-00.xml

andrew2net commented 2 years ago

@rjsparks thank you for the clarification. In the relaton model we don't have a filename attribute, so we generate it from a document identifier. In this case the identifier is value from the seriesInfo[@name='Internet-Draft'] element.

<seriesInfo name="Internet-Draft" value="draft-sparks-sipcore-multiple-reasons-00"/>

Threfore we will have only one file in our dataset for these four URL's you have mentioned above. Is it ok?

rjsparks commented 2 years ago

yes - specifically, that means you will recognize all four variants at the replacement for the webservice you are building, correct?

I think this means I should restrict what goes into the source I'm building for you to be just the the single file reference.I-D.draft-sparks-sipcore-multiple-reasons-00.xml then, and not include the other variants. Does that sound right to you?

andrew2net commented 2 years ago

It sounds good. Thanks.

rjsparks commented 2 years ago

So, I have replaced https://www.ietf.org/lib/dt/sprint/bibxml3.tgz with https://www.ietf.org/lib/dt/sprint/bibxml-ids.tgz

The contents will expand into bibxml-ids rather than bibxml3. This is in anticipation of the rsync endpoint that will replace this tarfile that will become available (I expect in early January).

The file is up to date as of ~2 days ago, and only contains the reference variant with the full draft name and version - no "latest" variants or "missing the draft- prefix" variants are included.

@ronaldtse This is a little different from where we ended on the earlier thread about bibxml-id - please review to make sure it's on an agreeable path.

ronaldtse commented 2 years ago

@rjsparks this sounds reasonable. @andrew2net can you help update? Thanks.

andrew2net commented 2 years ago

The new bibxml-ids.tgz is good. No more file duplication.

ronaldtse commented 2 years ago

@andrew2net is this ready to go? Thanks.

andrew2net commented 2 years ago

It's implemented but I'm going to publish this gem when finishing https://github.com/ietf-ribose/relaton-data-rfcs/issues/1 this weekend

ronaldtse commented 2 years ago

Got it, thanks!

andrew2net commented 2 years ago

@ronaldtse I've pushed the parsed data. Can you confirm if the files have correct format?

ronaldtse commented 2 years ago

Thanks @andrew2net .

@kwkwan @strogonoff Right now the bibxml-service is configured to use the bibxml-data-ids repository. Can you change the demo instance to load from the new repositories? Thanks.