Implement new version/dynamic fetching for Internet-Drafts

ronaldtse commented 2 years ago

The Internet-Drafts dataset is the special dataset that combines serving from both bulk loaded data with the latest data coming straight from datatracker.ietf.org.

datatracker.ietf.org is the authoritative source of the data. The bulk data the BibXML services loads is from a periodic export from datatracker.

There are two types of references that are served by the service:

Unversioned Reference pattern: draft-{example-name}. This is in fact a redirect reference to the draft of the highest draft-number number.
Versioned Reference pattern: draft-{example-name}-{draft-number}. The {draft-number} is a two-digit sequential integer, starting from 00 or 01 incrementally.

For the Versioned Reference pattern (given draft-{example-name}-{draft-number}), the operating mode is:

If the bulk Internet-Drafts dataset contains draft-{example-name}-{draft-number}, return it and done.
If the bulk Internet-Drafts dataset does not contain draft-{example-name}-{draft-number}, then it means that datatracker.ietf.org may have a new draft number or the draft did not exist in the bulk dataset. The BibXML Service should then contact datatracker.ietf.org to load draft-{example-name}-{draft-number}.
- If datatracker returns with data, then we cache the entry by entering it into our database, and return this.
- If datatracker does not have this draft, it means that the draft does not exist. We return an error.

For the Unversioned Reference pattern (given draft-{example-name}), the operating mode is:

We cannot know whether the draft-{example-name}-{draft-number} items we have are actually the newest.
We will have to proxy the request to datatracker.ietf.org every single time.
There may not be any point in caching this because again, we can never know. It would be better to err on the side of accuracy.

strogonoff commented 2 years ago

I’m currently addressing this the following way:

Adding Datatracker as an external source (like DOI)
Changing the behavior for citation retrieval API and GUI so that, when citation is requested by document ID, external sources are also checked
- The logic is 1) check internal sources, 2) check external sources, 3) merge citation data from all sources and display the result
- Since (2) may take time,
- in web GUI, if indexed citation was found at step (1), steps (2) and (3) will be deferred and happen client-side after the initial page is rendered
- in API, requesters can specify a flag that they are willing to wait for external sources (without this flag, API work faster but will only return results from indexed sources)

This means:

If we have a request for unversioned draft ID, BibXML service will not immediately find an indexed citation
However, BibXML service will also call Datatracker API, and get a result since presumably Datatracker API supports unversioned drafts
This means that unversioned I-D details page will be slower to load (and API users would have to pass an extra parameter), but otherwise it’d work

Questions to @ronaldtse:

Does the above make sense?
As we use Relaton as internal citation data model, do we have a way of converting Datatracker output to Relaton? This can be a tiny Python library with minimal API—take a structure as returned by Datatracker, and convert it to Relaton; all self-contained, no need to handle HTTP requests at this point, BibXML will do that. If no one is on it, I can deal with that (in that case I’d appreciate some pointers to Datatracker API spec).

ronaldtse commented 2 years ago

As we use Relaton as internal citation data model, do we have a way of converting Datatracker output to Relaton?

Yes, I believe @CAMOBAP has already done this in Python:

Relaton to BibXML format (Datatracker uses the BibXML format)
BibXML format to Relaton

Adding Datatracker as an external source (like DOI)

I don't object to this, but eventually treating Datatracker as DOI/Crossref is problematic.

DOI/Crossref will 100% flag us for traffic, and in our experience, goes down quite often.
Datatracker is "on our side" and will always be accessible (when it is available).

I wonder if one way to facilitate the check for the "latest draft version" is to have a new Datatracker API that just returns the "latest draft version".

e.g. If I request from datatracker.ietf.org/drafts/draft-xxx, I get the latest information on the draft version of xxx is (e.g. xxx-NN). Then if BibXML Service has already cached xxx-NN, we don't need a subsequent fetch, and that caching will be more effective. This will ease the load on Datatracker.

strogonoff commented 2 years ago

I wonder if one way to facilitate the check for the "latest draft version" is to have a new Datatracker API that just returns the "latest draft version".

What if published draft version is changed and our index has stale data? I can’t recall if this can happen.

strogonoff commented 2 years ago

Datatracker uses the BibXML format

Does it use BibXML format? I checked these two, and results look different from what we have in our BibXML data repositories:

I believe it might be easier to adapt Datatracker’s JSON responses to Relaton format.

ronaldtse commented 2 years ago

@strogonoff you're right! Don't know why I thought Datatracker used the BibXML format.

So... this is to be done using relaton-bib-py then (the Datatracker format => Relaton format conversion)?

strogonoff commented 2 years ago

IMO it seems least friction for me to just implement a quick converter from Datatracker JSON to Relaton as part of BibXML service itself and later split it out into its own package or included into relaton-bib-py.

strogonoff commented 2 years ago

This works for xml2rfc-style API already (see xml2rfc_compat.fetchers.internet_drafts() logic).

For main API, Datatracker should be queried by default only if requested document is not found (and request specifies correct doctype of “Internet-Draft”).

For GUI, this is not implemented yet. The original idea was to augment client-side part for item search and item details pages: query service API client-side and augment displayed data with new results, if any.

ietf-tools / bibxml-service

Implement new version/dynamic fetching for Internet-Drafts #63