Closed ronaldtse closed 3 years ago
For every registry, there is an XML file describing its contents, which is where we will get the dates and details of this registry.
e.g. for sip-parameters
:
$ wget https://www.iana.org/assignments/sip-parameters/sip-parameters.xml
The only metadata to obtain is this:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="sip-parameters.xsl"?>
<?xml-model href="sip-parameters.rng" schematypens="http://relaxng.org/ns/structure/1.0" ?>
<registry xmlns="http://www.iana.org/assignments" id="sip-parameters">
<title>Session Initiation Protocol (SIP) Parameters</title>
<created>2002-01</created>
<updated>2021-01-20</updated>
The rest are "sub-registry" information. There are 80 "sub-registries" in this particular registry:
$ grep '<registry' sip-parameters.xml | wc -l
81
e.g.
<registry id="sip-parameters-2">
<title>Header Fields</title>
<expert>Adam Roach</expert>
<xref type="rfc" data="rfc3261"/>
<xref type="rfc" data="rfc3427"/>
<xref type="rfc" data="rfc5727"/>
<note>The table below lists the header fields currently defined for the
Session Initiation Protocol (SIP) <xref type="rfc" data="rfc3261"/>. Some headers have
single-letter compact forms (Section 7.3 of RFC 3261). Header field
names are case-insensitive.
Standard header fields and messages MUST NOT begin with the leading
characters "P-". Existing "P-" header field registrations are
considered grandfathered, but new registrations of Informational
header fields should not begin with the leading characters "P-"
(unless the "P-" would preserve compatibility with an pre-existing
unregistered usage of the header field, at the discretion of the
Designated Expert). Short forms of header fields MUST only be
assigned to standards track header fields. At the discretion of the
Designated Expert, a header registration may require a Standards
Action.
</note>
@rjsparks I'd like to check if it's okay to make the IANA BibXML dataset a static one -- only 581 pages, crawled once a day, seems like the best approach. Thanks!
I'm not particularly comfortable with that without asking IANA. The RFP as specified would only fetch a registry from IANA when someone wanted bibtex related to it - it's quite likely that some IANA registries could go days or longer without being accessed that way. We would only be caching the results that were actually requested. This plan would force us adding a daily refresh load, which I admit isn't large, but I would want to ask them (which I will do and report back).
If we were to proceed this way, Given that the cache period from the RFP is configurable, the refresh scan rate would need to be configurable, and it would be best if there were a way to signal that a particular registry should be refreshed to avoid hitting all the IANA pages again to pick up one wanted changed registry.
btw - when registries do change, authors tend to be very impatient with the ability to reference the results.
Thanks for consulting IANA on this!
Re: authors being impatient. The current IANA BibXML only provides the name of the registry and nothing more, so the impatience should only apply when there's a new registry that an author can't wait to cite.
@rjsparks I just found out that Lars Eggert is already rsynching all the IANA registries on GitHub: https://github.com/larseggert/iana-assignments
So it's easy to rsync in this case, too.
@rjsparks we've implemented a mirror using the IANA rsync endpoint, with the full data available here: https://github.com/ietf-ribose/iana-registries . It's fast, and presumably won't be an extra burden to them (since others are already publicly doing so...).
I will file another issue for consideration of sync cadence:
There is a limited number of IANA registries -- they rarely change.
According to the mechanism used by today's IANA fetching script: https://trac.ietf.org/trac/xml2rfc/browser/website/public/rfc/bibxml-iana/nph-index.pl
Every registry corresponds to a single BibXML entry: https://xml2rfc.tools.ietf.org/public/rfc/bibxml-iana/reference.IANA.sip-parameters.xml
=>
Given that there are only 581 registries, performing 581 requests per-day isn't so bad. Instead of needing a 24-hour expiring cache, it would be more effective to scan the 581 pages daily and generate a static BibXML dataset.
These are the 581 registries today: